One article to get regular expressions in Python

Detailed re module###

This article explains in detail regular expressions and the re module in Python

<!- - MORE-->

Contents of this article###

What is a regular expression

Regular expression (regular expression) describes a string matching pattern (pattern), can be used to check whether a string contains a certain substring, replace the matched substring, or extract from a string that meets a certain condition Substring etc.

Regular expression function###

By using regular expressions, you can:

  1. Test the pattern within the string
    For example, you can test the input string to see if a phone number pattern or a credit card number pattern appears in the string. This is called data verification.
  2. Replacement text
    You can use regular expressions to identify specific text in a document, delete the text completely or replace it with other text.
  3. Extract substrings from strings based on pattern matching
    You can search for specific text in the document or in the input field, such as the content that you need directly from the content of the webpage through the crawler

Metacharacters and their meaning

Commonly used metacharacters

Symbol Meaning
Dot. Matches any character except line breaks
Asterisk* Match 0 or more arbitrary characters
question mark? Match 0 or 1 any character (non-greedy mode)
^ Start position
$ End position
\ s matches any blank
\ S matches any non-blank
\ d matches a digit
\ D matches a non-digit
\ w matches a word character, contains numbers and letters
\ W matches a non-word character, including numbers and letters
abcd Match any character in abcd
^ abcd matches any character that does not include abcd
Match the previous content one or more times
{ n} match n words (fixed)
{ n,} Match at least n times
{ n,m) match n to m times
x y
() Match the content in the brackets

Metacharacters

The following is a relatively complete metacharacter matching table

Metacharacter description
\ Put the next character token, or a backward quote, or an octal escape character. For example, "\n" matches\n. "\N" matches a newline character. The sequence "&quot; matches "&quot; and "(" matches "(". It is equivalent to the concept of "escape character" in many programming languages.
^ Match the beginning of the input line. If the Multiline property of the RegExp object is set, ^ also matches the position after "\n" or "\r".
$ Matches the end of the input line. If the Multiline property of the RegExp object is set, $ also matches the position before "\n" or "\r".
Match the preceding sub-expression any number of times. For example, zo* can match "z", as well as "zo" and "zoo". *Equivalent to {0,}.
Match the preceding sub-expression one or more times (greater than or equal to 1 time). For example, "zo+" can match "zo" and "zoo", but not "z". +Equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" can match "do" or "does". ? Equivalent to {0,1}.
{* n*} n is a non-negative integer. Match confirmed n times. For example, "o{2}" cannot match the "o" in "Bob", but it can match the two o's in "food".
{* n*,} n is a non-negative integer. Match at least n times. For example, "o{2,}" cannot match the "o" in "Bob", but it can match all o in "foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".
{* n*,m} m and n are non-negative integers, where n<=m. Match at least n times and match at most m times. For example, "o{1,3}" will match the first three o's in "fooooood" as a group, and the last three o's as a group. "O{0,1}" is equivalent to "o?". Please note that there can be no spaces between the comma and the two numbers.
? When the character immediately follows any other qualifiers (*,+,?, {n}, {n,}, {n,m}), the matching mode is non-greedy. The non-greedy mode matches the searched string as little as possible, while the default greedy mode matches the searched string as much as possible. For example, for the string "oooo", "o+" will match "o" as much as possible and get the result "oooo", and "o+?" will match "o" as little as possible, and get the result'o', ' o','o','o'
. Dot matches any single character except "\n" and "\r". To match any character including "\n" and "\r", use a pattern like "\s\S". (Does not match newline characters)
( pattern) Match pattern and get this match. The obtained matches can be obtained from the generated Matches collection, the SubMatches collection is used in VBScript, and the $0...$9 properties are used in JScript. To match parenthesis characters, use "(" or ")".
(?: pattern) Non-acquisition matching, matching the pattern but not obtaining the matching result, and not storing it for later use. This is useful when using the or character "(
(?= pattern) Non-acquisition matching, positive positive pre-check, matching the search string at the beginning of any string matching pattern, the match does not need to be acquired for future use. For example, "Windows(?=95
(?! pattern) Non-acquisition matching, forward negative pre-check, matching the search string at the beginning of any string that does not match the pattern, the match does not need to be acquired for future use. For example, "Windows(?!95
(?<= pattern) Non-acquisition matching, reverse positive pre-check, similar to positive positive pre-check, but in the opposite direction. For example, "(?<=95
(?<! patte_n) Non-acquisition matching, reverse negative pre-check, similar to forward negative pre-check, but in the opposite direction. E.g"(? <!95
x y
xyz Character set. Match any one character contained. For example, "abc" can match the "a" in "plain".
^ xyz Negative character set. Match any character not included. For example, "^abc" can match any character of "plin" in "plain".
az Character range. Match any character in the specified range. For example, "az" can match any lowercase alphabetic character from "a" to "z". Note: Only when the hyphen is inside the character group and appears between two characters, can it represent the range of characters; if it is out of the beginning of the character group, it can only represent the hyphen itself.
^ az Negative character range. Match any character that is not in the specified range. For example, "^az" can match any character that is not in the range of "a" to "z".
\ b Match the boundary of a word, that is, the position between the word and the space (that is, the "match" of regular expressions has two concepts, one is the matching character, the other is the matching position, where \b is the matching position of). For example, "er\b" can match "er" in "never" but not "er" in "verb"; "\b1*" can match "1*" in "1_23", but it cannot match "1*" in "21*3".
\ B Match non-word boundaries. "Er\B" can match the "er" in "verb" but not the "er" in "never".
\ cx matches the control character specified by x. For example, \cM matches a Control-M or carriage return character. The value of x must be one of AZ or az. Otherwise, treat c as a literal "c" character.
\ d matches a digit character. Equivalent to 0-9. grep needs to add -P, perl regular support
\ D matches a non-digit character. Equivalent to ^0-9. grep should add -P, perl regular support
\ f matches a form feed character. Equivalent to \x0c and \cL.
\ n matches a newline character. Equivalent to \x0a and \cJ.
\ r matches a carriage return character. Equivalent to \x0d and \cM.
\ s Matches any invisible characters, including spaces, tabs, form feeds, etc. Equivalent to \f\n\r\t\v.
\ S matches any visible character. Equivalent to ^ \f\n\r\t\v.
\ t matches a tab character. Equivalent to \x09 and \cI.
\ v matches a vertical tab character. Equivalent to \x0b and \cK.
\ w matches any word character including the underscore. Similar but not equivalent to "A-Za-z0-9_", the "word" character here uses the Unicode character set.
\ W matches any non-word character. Equivalent to "^A-Za-z0-9_".
\ xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be two digits long. For example, "\x41" matches "A". "\X041" is equivalent to "\x04&1". ASCII encoding can be used in regular expressions.
* num* matches num, where num is a positive integer. A reference to the obtained match. For example, "(.)\1" matches two consecutive identical characters.
* n* Identifies an octal escape value or a backward reference. If n has at least n acquired sub-expressions before, then n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
* nm* Identifies an octal escape value or a backward reference. If there are at least nm sub-expressions before nm, then nm is a backward reference. If there are at least n acquisitions before nm, n is a backward reference followed by the text m. If the preceding conditions are not met, if both n and m are octal numbers (0-7), then nm will match the octal escape value nm.
* nml* If n is an octal digit (0-7), and both m and l are octal digits (0-7), match the octal escape value nml.
\ un matches n, where n is a Unicode character represented by four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©).
\ p{P} Lowercase p means property, which means Unicode property, and is used as a prefix for Unicode regular expressions. The "P" in the brackets represents one of the seven character attributes of the Unicode character set: punctuation characters. The other six attributes: L: letters; M: mark symbols (generally not appearing alone); Z: separators (such as spaces, newlines, etc.); S: symbols (such as mathematical symbols, currency symbols, etc.); N: numbers ( Such as Arabic numerals, Roman numerals, etc.); C: other characters. **Note: This syntax part of the language is not supported, for example: javascript. *
<> Match the beginning (&lt;) and end (>) of a word (word). For example, regular expression <the> It can match the "the" in the string "for the wise", but it cannot match the "the" in the string "otherwise". Note: This meta character is not supported by all software.
( ) Define the expression between (and) as a "group" (group), and save the characters that match this expression to a temporary area (a regular expression can save up to 9), they can use \1 to \9 to quote.

Detailed re module###

There module is provided in python to deal with regular expression problems. Here are a few commonly used methods

re.match

re.match tries to match a pattern from the starting position of the string. If the matching is not successful at the starting position, match() returns none.

This method returns a regular matching object

Syntax
import re
re.match(pattern, string, flags=0)
Parameter Description#####
Parameters Description
pattern matched regular expression
string The string to match.
flags Flags, used to control the matching mode of regular expressions, such as: case-sensitive, multi-line matching, etc.
demo
# Most regular match
content ="Hello 1234567 World_This is a Regex Demo"print(len(content))
result = re.match("^Hello\s\d+\s\w{10}.*?Demo$", content)   #Must be matched from the starting position
# result = re.match("^Hello\s\d{7}\s\w{10}.*?Demo$", content)print(result)print(result.group())print(result.span())

If there is a newline character, use the flag re.S

# If there is a newline, use the flag symbol

content ="""Hello 1234567 World_This is a Regex Demo.
My name is Peter
I am from shenzhen
"""
print(len(content))
result = re.match("^Hello\s\d+\s.*?shenzhen$", content, re.S)
# result = re.match("^Hello\s\d{7}\s\w{10}.*?Peter$", content)print(result)print(result.group())print(result.span())

line ="Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*?', line, re.M|re.I)if matchObj:print("matchObj.group() : ", matchObj.group())  #All content returned
 print("matchObj.group(1) : ", matchObj.group(1))  #Return to the first()Content in
 print("matchObj.group(2) : ", matchObj.group(2))  #2nd
else:print("No match!!")

re.matchUse as little as possible

re.matchUse as little as possible

re.matchUse as little as possible


re.search

re.search scans the entire string and returns the first successful match, otherwise it returns None. This method does not require starting from the starting position. Once the first content that meets the requirements is found, it will stop searching

You can use the group(num) or groups() matching object function to get the result of the matching expression.

Function syntax
re.search(pattern, string, flags=0)
Parameter Description#####
Parameters Description
pattern matched regular expression
string The string to match.
flags Flags, used to control the matching mode of regular expressions, such as: case-sensitive, multi-line matching, etc.
demo

  1. Return the first element that matches successfully
  2. The parameters in the group() method cannot exceed the number of parentheses

re.findall

re.findall scans the entire string and returns all eligible elements in the form of list

grammar#####
findall(pattern, string, flags=0)
Parameter Description#####
Parameters Description
pattern matched regular expression
string The string to match.
flags Flags, used to control the matching mode of regular expressions, such as: case-sensitive, multi-line matching, etc.
demo

The result is in list form

If the extracted content contains multiple .*?, then the return is still in the form of a list, but the elements inside become a tuple form

re.sub

Use regular expressions to replace certain content in a string

grammar#####
re.sub(pattern, repl, string, count)
Parameter Description#####

The meanings of the parameters are:

demo

sub special processing####

re.sub allows special processing of matching items using functions

Two modes###

Two modes refer to: greedy mode and non-greedy mode

3 Symbols

We often use 3 symbols in regular expressions:

demo

Explanation####

  1. In the non-greedy mode example above, the question mark is used? , Which means non-greedy mode, when it starts to match that aaaacb has met the requirements, the first one is found; then it starts to match again, it matches ab; it matches again adceb
  2. In the greedy mode example, the program will find the longest string that meets the requirements
  3. In the last example, .? is used, which means that there can only be 0 or 1 elements between ab, so there are only two cases in the result

Regex modifier-optional flag

Regular expressions can contain some optional flag modifiers to control the matching pattern. The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR (|) them. For example, re.I | re.M is set to I and M flags:

Modifier Description
re.I Make matches case insensitive
re.L Do locale-aware matching
re.M Multi-line matching, affects ^ and $
re.S Make. match all characters including newline
re.U Analyze characters according to the Unicode character set. This flag affects \w, \W, \b, \B.
re.X This flag allows you to write regular expressions more easily by giving you a more flexible format.

Regular expression example###

Character matching

Example Description
python matches "python".

Character class

Example Description
Ppython Match "Python" or "python" Choose a letter from Pp to match
rubye match "ruby" or "rube" ye choose one match
aeiou Match any letter in the brackets Match a letter in aeiou
0- 9 Match any number. Similar to 0123456789 matches any number of digits
az match any lowercase letter
AZ matches any uppercase letter
a-zA-Z0-9 Match any letter and number
^ aeiou All characters except aeiou letters ^ means inversion operation
^0- 9 Matches characters except digits

Special character class

Example Description
. Match any single character except "\n". To match any character including'\n', use a pattern like'.\n'.
\ d matches a digit character. Equivalent to 0-9.
\ D matches a non-digit character. Equivalent to ^0-9.
\ s Matches any blank character, including spaces, tabs, form feeds, etc. Equivalent to \f\n\r\t\v.
\ S matches any non-blank character. Equivalent to ^ \f\n\r\t\v.
\ w matches any word character including the underscore. Equivalent to'A-Za-z0-9_'.
\ W matches any non-word character. Equivalent to'^A-Za-z0-9_'.

to sum up###

References

Novice Course-Regular Expression

python-regular expression

Regular expression online test

Python3-regular expression

[ Regular Expression Complete](https://blog.csdn.net/qq_28633249/article/details/77686976?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant.none- task-blog-BlogCommendFromMachineLearnPai2-1.nonecase)

re module

Recommended Posts

One article to get regular expressions in Python
A quick introduction to Python regular expressions
An article to understand the yield in Python
How to wrap in python code
How to omit parentheses in Python
How to write classes in python
How to filter numbers in python
How to read Excel in Python
How to view errors in python
How to write return in python
How to understand variables in Python
How to clear variables in python
How to use SQLite in Python
How to use and and or in Python
How to delete cache files in python
How to introduce third-party modules in Python
How to represent null values in python
How to save text files in python
How to write win programs in python
How to run id function in python
How to install third-party modules in Python
How to custom catch errors in python
How to write try statement in python
Python crawler example to get anime screenshots
How to define private attributes in Python
Learn Python in one minute | Object-oriented (Chinese)
How to add custom modules in Python
Learn Python in one minute | Python functions (on)
How to understand global variables in Python
How to view installed modules in python
How to open python in different systems
How to sort a dictionary in python
How to get started quickly with Python
A first look at Python regular expressions (6)
How to add background music in python
How to represent relative path in python
How to use the round function in python
How to use the zip function in Python
The usage of several regular expressions in Linux
How to program based on interfaces in Python
How to install python in ubuntu server environment
How to simulate gravity in a Python game
The usage of several regular expressions in Linux
How to use the format function in python
How to use code running assistant in python
How to set code auto prompt in python
Teach you how to write games in python
How to delete files and directories in python
Learn Python in One Minute | Object Oriented (Part 1)
The usage of several regular expressions in Linux
How to install the downloaded module in python
One picture flow: all built-in exceptions in Python
How to write a confession program in python
The best way to judge the type in Python
How to perform continuous multiplication calculation in python
Python review one
Functions in python
01. Introduction to Python
Introduction to Python
How to understand the introduction of packages in Python
How to understand a list of numbers in python