The re module makes the Python language have all the regular expression functions.
Syntax that will be used
Regular characters | Interpretation | Example |
---|---|---|
+ | The previous element appears at least once | ab+: ab, abbbb, etc. |
* | The preceding element appears 0 or more times | ab*: a, ab, abb, etc. |
? | Match the previous one or 0 times | Ab?: A, Ab, etc. |
^ | As the start tag | ^a: abc, aaaaaa, etc. |
$ | as the end tag | c$: abc, cccc, etc. |
\ d | Number | 3, 4, 9 and other regular character interpretation examples + the preceding element appears at least once ab+: ab, abbbb, etc. The preceding element appears 0 or more times ab: a, ab, abb, etc.? Match the preceding one or 0 Ab?: A, Ab, etc.^ as the start tag ^a: abc, aaaaaa, etc. $ as the end tag c$: abc, cccc, etc. \d number 3, 4, 9 etc.\D non-number A, a,-etc. [az]Any letter a, p, m, etc. between A and z [0-9] Any number 0, 2, 9, etc. between 0 and 9 |
Regular characters | Interpretation | Example |
+ | The previous element appears at least once | ab+: ab, abbbb, etc. |
* | The preceding element appears 0 or more times | ab*: a, ab, abb, etc. |
? | Match the previous one or 0 times | Ab?: A, Ab, etc. |
^ | As the start tag | ^a: abc, aaaaaa, etc. |
$ | as the end tag | c$: abc, cccc, etc. |
\ d | Numbers | 3, 4, 9, etc. |
\ D | Non-number | A, a,-etc. |
[ az] | Any letter between A and z | a, p, m, etc. |
[0- 9] | 0 Any number from to 9 | 0, 2, 9, etc. |
\ D | Non-number | A, a,-etc. |
[ az] | Any letter between A and z | a, p, m, etc. |
[0- 9] | 0 Any number from to 9 | 0, 2, 9, etc. |
note:
s
'( abc)def'
m = re.search("(\(.*\)).*", s)
print m.group(1)(abc)
re.match function
re.match tries to match a pattern from the beginning of the string. If the match is not successful at the beginning, match() returns none.
Example 1:
#! /usr/bin/python
# - *- coding: UTF-8-*-import re
print(re.match('www','www.zalou.cn').span()) #Match at the start
print(re.match('net','www.zalou.cn')) #Does not match at the beginning
Output result:
(0, 3)
None
Example 2:
#! /usr/bin/python
import re
line ="Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)else:
print "No match!!"
Output result:
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
The above is the print output of python2. Just remember to add () in python, and the output of python is similar to the \n in other languages to match the obtained content.
python group()
In regular expressions, group() is used to propose the string intercepted by grouping, and () is used to group
Repeat the previous string multiple times
a ="kdlal123dk345"
b ="kdlal123345"
m = re.search("([0-9]+(dk){0,1})[0-9]+", a)
m.group(1), m.group(2)('123dk','dk')
m = re.search("([0-9]+(dk){0,1})[0-9]+", b)
m.group(1)'12334'
m.group(2)
The reason
group() is the same as group(0) to match the overall result of the regular expression
group(1) lists the first bracket matching part, group(2) lists the second bracket matching part, group(3) lists the third bracket matching part.
2. If there is no successful match, re.search() returns None
Example
# - *- coding: cp936 -*-import re
s1 ='adkkdk'
s2 ='abc123efg'
an = re.search('^[a-z]+$', s1)if an:
print 's1:', an.group(),'All lowercase'else:
print s1,"Not all lowercase!"
an = re.match('[a-z]+$', s2)if an:
print 's2:', an.group(),'All lowercase'else:
print s2,"Not all lowercase!"
result
The reason
Regular expressions are not part of python, you need to quote the re module when using
The matching form is: re.search (regular expression, with matching string) or re.match (regular expression, with matching string). The difference between the two is that the latter starts with a start character (^) by default. therefore,
re.search('[1]+′,s1) is equivalent to re.match('[a−z]+', s2)
group is used to group matching results
E.g
import re
a ="123abc456"
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0) #123abc456,Back to the whole
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1) #123
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2) #abc
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3) #456
Output result
123 abc456
123
abc
456
1 ) The three sets of parentheses in the regular expression divide the matching results into three groups
group() is the same as group(0) to match the overall result of the regular expression
group(1) lists the first bracket matching part, group(2) lists the second bracket matching part, group(3) lists the third bracket matching part.
2 ) If there is no match, re.search() returns None
3 ) Of course, there are no parentheses in the regular expression, and group(1) is definitely wrong.
2. Acronym expansion
Concrete example
FEMA Federal Emergency Management Agency
IRA Irish Republican Army
DUP Democratic Unionist Party
FDA Food and Drug Administration
OLC Office of Legal Counsel
analysis
Abbreviation FEMA
Decomposed into F*** E*** M*** A***
Regular uppercase letters + lowercase (more than or equal to 1) + spaces
Reference Code
import re
def expand_abbr(sen, abbr):
lenabbr =len(abbr)
ma =''for i inrange(0, lenabbr):
ma += abbr[i]+"[a-z]+"+' '
print 'ma:', ma
ma = ma.strip(' ')
p = re.search(ma, sen)if p:return p.group()else:return''
print expand_abbr("Welcome to Algriculture Bank China",'ABC')
result
problem
The above code is correct for the first 3 in the example, but the latter two are wrong, because the words beginning with capital letters are also mixed with lowercase letters.
law
Uppercase letter + lowercase (more than or equal to 1) + space + [lower case + space] (0次或1次)
Reference Code
import re
def expand_abbr(sen, abbr):
lenabbr =len(abbr)
ma =''for i inrange(0, lenabbr-1):
ma += abbr[i]+"[a-z]+"+' '+'([a-z]+ )?'
ma += abbr[lenabbr-1]+"[a-z]+"
print 'ma:', ma
ma = ma.strip(' ')
p = re.search(ma, sen)if p:return p.group()else:return''
print expand_abbr("Welcome to Algriculture Bank of China",'ABC')
skill
The set of lowercase letters in the middle + a space, as a whole, add parentheses. Either at the same time or not at the same time, so need to use it? , To match the whole ahead.
3. Remove the comma in the number
Concrete example
When dealing with natural language, if 123,000,000 is divided by punctuation, there will be a problem. A good number is dismembered by a comma, so you can deal with the number first (remove the comma).
analysis
The number is often a group of 3 numbers, followed by a comma, so the rule is: ***, ***, ***
Regular expression
[ a-z]+,[a-z]?
Reference code 3-1
import re
sen ="abc,123,456,789,mnp"
p = re.compile("\d+,\d+?")for com in p.finditer(sen):
mm = com.group()
print "hi:", mm
print "sen_before:", sen
sen = sen.replace(mm, mm.replace(",",""))
print "sen_back:", sen,'\n'
result
skill
Use the function finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):
Search for string and return an iterator that sequentially accesses each match result (Match object).
Reference code 3-2
sen ="abc,123,456,789,mnp"while1:
mm = re.search("\d,\d", sen)if mm:
mm = mm.group()
sen = sen.replace(mm, mm.replace(",",""))
print sen
else:break
result
extend
Such a program is aimed at a specific problem, that is, a group of 3 digits. If the numbers are mixed with letters, eliminate the commas between the numbers, that is, convert "abc,123,4,789,mnp" into "abc,1234789,mnp"
Ideas
More specifically, find the regular expression "number, number" and replace it with the comma removed
Reference code 3-3
sen ="abc,123,4,789,mnp"while1:
mm = re.search("\d,\d", sen)if mm:
mm = mm.group()
sen = sen.replace(mm, mm.replace(",",""))
print sen
else:break
print sen
result
4. Year conversion for Chinese processing (for example: 1949-1949)
Chinese processing involves coding issues. For example, when the program below recognizes the year (****year)
# - *- coding: cp936 -*-import re
m0 ="New China was founded in 1949"
m1 ="5.2% lower than in 1990"
m2 ='Man defeated the Russian army in 1996,Achieved substantial independence'
def fuc(m):
a = re.findall("[zero|One|two|three|four|Fives|six|Seven|Eight|nine]+year", m)if a:for key in a:
print key
else:
print "NULL"fuc(m0)fuc(m1)fuc(m2)
operation result
It can be seen that there are errors in the second and third.
Improvement-standardization into unicode recognition
# - *- coding: cp936 -*-import re
m0 ="New China was founded in 1949"
m1 ="5.2% lower than in 1990"
m2 ='Man defeated the Russian army in 1996,Achieved substantial independence'
def fuc(m):
m = m.decode('cp936')
a = re.findall(u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", m)if a:for key in a:
print key
else:
print "NULL"fuc(m0)fuc(m1)fuc(m2)
result
It can be recognized by replacing Chinese characters with numbers.
reference
numHash ={}
numHash['zero'.decode('utf-8')]='0'
numHash['One'.decode('utf-8')]='1'
numHash['two'.decode('utf-8')]='2'
numHash['three'.decode('utf-8')]='3'
numHash['four'.decode('utf-8')]='4'
numHash['Fives'.decode('utf-8')]='5'
numHash['six'.decode('utf-8')]='6'
numHash['Seven'.decode('utf-8')]='7'
numHash['Eight'.decode('utf-8')]='8'
numHash['nine'.decode('utf-8')]='9'
def change2num(words):
print "words:",words
newword =''for key in words:
print key
if key in numHash:
newword += numHash[key]else:
newword += key
return newword
def Chi2Num(line):
a = re.findall(u"[\u96f6|\u4e00|\u4e8c|\u4e09|\u56db|\u4e94|\u516d|\u4e03|\u516b|\u4e5d]+\u5e74", line)if a:
print "------"
print line
for words in a:
newwords =change2num(words)
print words
print newwords
line = line.replace(words, newwords)return line
5. Multiple mobile phone numbers, separated by |
For example:
Null value
12222222222
12222222222|12222222222
12222222222|12222222222|12222222444
expression
s = “[\d]{11}(\|[\d]{11})*|”
4. Recommended
Python regular expression guide
So far this article about python regular expression example code is introduced. For more relevant python regular expression example content, please search for previous articles on ZaLou.Cn or continue to browse related articles below. Hope you will support ZaLou.Cn more in the future!
Recommended Posts