python_ crawler basic learning

——Wang Yuyang—take notes based on the mooc course summary (Code_boy)

Requests library: automatically crawl to HTML pages, automatic network request submission

robots.txt: Web crawler exclusion criteria

Beautiful Soup library: parse HTML pages (install bs4 in pycharm)

re regular expression: the choice of HTML data analysis

requests.get(url[,params = None,**kwargs])

url: Get the URL link of the webpage

params: extra parameters in url, dictionary or byte stream format, optional

**kwargs: 12 parameters to control access

Properties of the Response object {0.0.py}

Properties	Description
r.status_code	Return status of HTTP request, 200 means connection is successful, 404 means failure
r.text	The string form of the HTTP response content, (ie: the page content corresponding to the url)
r.encoding	The encoding method of the response content guessed from the HTTP header
r.apparent_encoding	The response content encoding method analyzed from the content (alternative encoding method)
r.content	Binary form of HTTP response content

Examples:

1 r = requests.get("http://www.baidu.com")2print(r.status_code)        #Return status code (content value is &#39;200&#39; indicating successful access)
 3 # Result: 2004print(r.text)   #Returns the string form of the response content
 5 # Result: [ie-Source code content of baidu homepage (with garbled characters)]
 6 print(r.encoding)   #The encoding method of the response content guessed from the HTTP header
 7 # Result: ISO-8859-1 The coding standard (method) of the baidu page
 8 print(r.apparent_encoding)  #Coding method of response content analyzed from content[Alternative encoding]9 #Result: utf-810print(r.content)    #Binary form of content
11 # Result: I don’t understand anyway
1213 r.encoding ='utf-8'    #(utf-8 is r.apparent_encoding result)14print(r.text)15 #As a result, the human-readable response content is mainly through r.the redefinition type of encoding, and r.text refers to the encoding method of encoding

General crawler code framework:

1 def getHTMLText(url):2try:3         r = requests.get(url,timeout =30)4         r.raise_for_status() #If the status is not 200, an exception is thrown: requests.HTTPError(r.status_code)5         r.encoding = r.apparent_encoding
 6 return r.text
 7  except:8return"Produces an exception"910if __name__ =="__main__":11     url ='http://www.baidu.com'12print(getHTMLText(url))

Requests library exception

Exception	description
requests.ConnectionError	Network connection error exception, (such as DNS query failure, connection refused, etc.)
requests.HTTPError	HTTP Error Exception
requests.URLequired	URL missing exception
requests.TooManyRedirects	Exceeded the maximum number of redirects, a redirect exception occurred
requests.ConnectTimeout	Connection to remote server timed out exception
requests.Timeeout	Request URL timeout, timeout exception occurred
r.raise_for_status	If it is not 200, an exception occurs: requetst.HTTPError

Examples of general code framework:

1 def getHTMLText(url):2try:3         r = requests.get(url,timeout =30)4         r.raise_for_status() #If the status is not 200, an exception is thrown: requests.HTTPError(r.status_code)5         r.encoding = r.apparent_encoding
 6 return r.text
 7  except:8return"Produces an exception"910if __name__ =="__main__":11     url ='http://www.baidu.com'12print(getHTMLText(url))

The 7 main methods of the Requests library

Method	Description
requests.request()	Construct a request to support the basic method of the following methods
requests.get()	The only way to get HTML pages, corresponding to HTTP GET
requests.head()	Method to get HTML page header information, corresponding to HTTP HEAD
requests.post()	The method of submitting POST requests to HTML pages, corresponding to HTTP POST
requests.put()	The method of submitting PUT requests to HTML pages, which corresponds to HTTP PUT
requests.patch()	Submit a partial modification request to the HTML page, which corresponds to the PATCH of HTTP
requests.delete()	Submit a delete request to the HTML page, corresponding to HTTP DELETE

HTTP protocol:

HTTP, Hypertext Transfer Protocol, Hypertext Transfer Protocol

HTTP is a stateless application layer protocol based on the "request and response" model.

HTTP uses URL to locate network resource identification

URL format: http://host[:port][path]

host: legal Internet host domain name or IP address

port: port number (default port is 80)

path: the path of the requested resource

HTTP protocol operations on resources:

Method	Description
GET	Request to get the resource at URL location
HEAD	Request to get the response information report of the URL location resource, that is, get the header information of the resource
POST	Request to append new data to the resource at the URL location
PUT	Request to store a resource at the URL location, overwriting the resource at the original URL location
PATCH	Request a partial update of the resource at the URL location, that is, change part of the content of the resource
DELETE	Request to delete resources stored in URL location

Understand the difference between PATCH and PUT

Suppose there is a set of data UserInfo in the URL location, including 20 fields such as UserID and UserName

Requirement: User modified UserName, others remain unchanged

Using PATCH, only submit partial update request of UserName to URL [Add]
With PUT, all 20 fields must be submitted to the URL together, and the unsubmitted fields are deleted [overwrite and append]

1 # the head of the requests library()Method (request to obtain the response information report of the URL location resource, that is, obtain the header information of the resource)2 r = requests.head('http://httpbin.org/get')3print(r.headers)4'''
 5 result:
 6{' Connection':'keep-alive','Server':'gunicorn/19.9.0',7'Date':'Thu, 22 Nov 2018 03:52:22 GMT','Content-Type':8'application/json','Content-Length':'268',9'Access-Control-Allow-Origin':'*',10'Access-Control-Allow-Credentials':'true','Via':'1.1 vegur'}11'''
12 # requests library post()Method (request to append new data to the resource at the URL location)
13 payload ={'key1':'value1','key2':'value2'}14 r = requests.post('http://httpbin.org/post',data = payload)15 #POST a dictionary to the URL, automatically encode the unform (form)
16 print(r.text)17'''
18 result:
19{20" args":{},21"data":"",22"files":{},23"form":{24"key1":"value1",25"key2":"value2"26},27"headers":{28"Accept":"*/*",29"Accept-Encoding":"gzip, deflate",30"Connection":"close",31"Content-Length":"23",32"Content-Type":"application/x-www-form-urlencoded",33"Host":"httpbin.org",34"User-Agent":"python-requests/2.20.1"35},36"json":null,37"origin":"106.111.147.213",38"url":"http://httpbin.org/post"39}40'''
41 r = requests.post('http://httpbin.org/post',data ='abc')42 #POST a string to the URL automatically encoded as data
43 print(r.text)44'''
45 result:
46{47" args":{},48"data":"abc",49"files":{},50"form":{},51"headers":{52"Accept":"*/*",53"Accept-Encoding":"gzip, deflate",54"Connection":"close",55"Content-Length":"3",56"Host":"httpbin.org",57"User-Agent":"python-requests/2.20.1"58},59"json":null,60"origin":"106.111.147.213",61"url":"http://httpbin.org/post"62}63'''
64 # put of the requests library()Method (request to store a resource to the URL location, overwriting the resource at the original URL location)
65 payload ={'key1':'value1','key2':'value2'}66 r = requests.put('http://httpbin.org/put',data = payload)67 #PUT a dictionary to the URL, automatically encode the unform (form)[The difference with post is that it will overwrite the original content]68print(r.text)69'''
70 result:71{72"args":{},73"data":"",74"files":{},75"form":{76"key1":"value1",77"key2":"value2"78},79"headers":{80"Accept":"*/*",81"Accept-Encoding":"gzip, deflate",82"Connection":"close",83"Content-Length":"23",84"Content-Type":"application/x-www-form-urlencoded",85"Host":"httpbin.org",86"User-Agent":"python-requests/2.20.1"87},88"json":null,89"origin":"106.111.147.213",90"url":"http://httpbin.org/put"91}92'''

Method function and use of requests library:

requests.request(method,url,**kwargs)

method: request method, corresponding to 7 methods such as get/put/post

r = requests.request(‘GET’,url,**kwargs)

r = requests.request(‘HEAD,’url,**kwargs)

r = requests.request(‘POST’,url,**kwargs)

r = requests.request(‘PUT’,url,**kwargs)

r = requests.request(‘PATCH’,url,**kwargs)

r = requests.request(‘delete’,url,**kwargs)

r = requests.request('OPTIONS',url,**kwargs)#Get relevant parameters from the service

url: The url link of the page to be obtained

**kwargs: parameters to control access, a total of 13 (optional)

params: dictionary or byte sequence, added as a parameter to the url ( link part system will add a'?')

data: dictionary, byte sequence or file object, as the content of Request

json: data in json format as the content of the request

headers: dictionary, HTTP custom headers

cookies: dictionary or CookieJar, cookie in Request *

auth: tuple, support HTTP authentication function*

files: dictionary type, transfer files

timeout: Set the timeout time, unit: seconds (if the time is returned, it returns an exception)

proxies: dictionary type, set access proxy server, can increase login authentication

pxs={'http':'http://user:[email protected]:1234'}
r = requests.request('GET','http://www.baidu.com',proxies = pxs)

allow_redirects: True/False, the default is True, redirect switch

stream: True/False, the default is True, the content is downloaded immediately

verify: True/False, the default is True, verify the SSL certificate switch

cert: local SSL certificate path

*requests.get(url,params=None,kwargs)

url: URL link of the page

params: additional parameters in url, dictionary or byte stream format, optional

**kwargs: 12 parameters that control access

**requests.head(url,kwargs)

url: URL link of the page

**kwargs: 13 parameters to control access

**requests.post(url,data=None,json=None,kwargs)

url: URL link of the page

data: dictionary, byte sequence or file, the content of Request

json: data in JSON format, the content of the request

**kwargs: 11 control access parameters (data and json have been used)

**requests.put(url,data=None,kwargs)

url: URL link of the page

data: dictionary, byte sequence or file, the content of Request

**kwargs: 12 parameters that control access

**requests.patch(url,data=None,kwargs)

url: URL link of the page

data: dictionary, byte sequence or file, the content of Request

**kwargs: 12 parameters that control access

**requests.delete(url,kwargs)

url: delete the url link of the page

**kwargs: 13 parameters to control access

The size of the web crawler:

Style	Crawl web pages, fun web pages	Crawl websites, crawl series of websites	Crawl the whole network
Scale	Small scale, small amount of data, insensitive to crawling speed (>=90%)	Medium scale, large scale of data, sensitive to crawling speed	Large scale, search engine, crawling speed is critical
Common libraries	requests library	scrapy library	custom development

Harassment by web crawlers:

Crawlers use fast functions to access web servers, it is difficult for the server to provide resources suitable for high-speed crawlers

Limited to the purpose of writing level, web crawlers will bring huge resource overhead to web servers

Legal risks of web crawlers:

The data on the server has property ownership

Profit after the data obtained by the web crawler will bring legal risks

Web crawler == "Crawl also has the right way"

Limit crawler conditions:

Source review: Judging User-Agent to restrict

·Check the User-Agent domain of the visiting HTTP protocol header, and only respond to the visit of the browser or friendly crawler

Announcement: Robots Agreement*****

Robots Exclusion Standard Web crawler exclusion standard

Role: Inform all crawlers of the crawling strategy of the website and require crawlers to comply

Format: robots.txt file in the root directory of the website

JD Robots Agreement: (https://www.jd.com/robots.txt)

User-agent: * Disallow: /?* Disallow: /pop/.html Disallow: /pinpai/.html?* User-agent: EtaoSpider Disallow: / User-agent: HuihuiSpider Disallow: / User-agent: GwdangSpider Disallow : / User-agent: WochachaSpider Disallow: /	Should any crawler obey to not allow access? The path at the beginning is not allowed to access /pop/.html is not allowed to access /pinpai/.html?* EtaoSpider, HuihuiSpider, GwdangSpider, WochachaSpider are identified as malicious crawlers and are prohibited from crawling any content

Basic syntax of Robots protocol:

#	Notes
*	Represents all
/	Represents the root directory
User-agent: *	Specify crawlers (* means all crawlers)
Disallow:/	Directory not allowed to crawler

Crawler basic 1 example: (code + summary) {0.1.py}

1 # Example: crawling Jingdong goods
 2 import requests
 3 import os
 4'''
 5 r = requests.get('https://item.jd.com/36826348085.html?jd_pop=14982c1c-64d9-4bab-ac5c-e40af7ce62a2&abt=0')6print(r.status_code)7print(r.encoding)8print(r.text[:1000])9 #Example: crawling Amazon
1011 kv ={'User-Agent':'mozilla/5.0'}   #Create access headers (pretend to be a browser)
12 r=requests.get('https://www.amazon.cn/dp/B073LJR2JF/ref=cngwdyfloorv2_recs_0/459-8865541-1481366?pf_rd_m=A1AJ19PSB66TGU&pf_rd_s=desktop-2&pf_rd_r=AT12APC0BDJ216Y55VSR&pf_rd_r=AT12APC0BDJ216Y55VSR&pf_rd_t=36701&pf_rd_p=d2aa3428-dc2b-4cfe-bca6-5e3a33f2342e&pf_rd_p=d2aa3428-dc2b-4cfe-bca6-5e3a33f2342e&pf_rd_i=desktop',headers=kv)13print(r.status_code)14print(r.text[:1000])15print(r.request.headers)16'''
17 # Example: Baidu search
18 # Baidu search search url: http://www.baidu.com/s?wd:{Search content}19'''
20 keyword ='python'21try:22     kv ={'wd':keyword}23     r = requests.get('http://www.baidu.com/s',params = kv)24print(r.request.url)25     r.raise_for_status()26print(len(r.text))27 except:28print("Crawl failed")29'''
30 # Case: Web image crawling and storage
31'''
32 url ="http://img0.dili360.com/pic/2018/09/21/5ba4a6af2af103q29157813_t.jpg"33 rot ="E://"34 path = rot + url.split('/')[-1] #Specify the image save name as the suffix in the url address (url)
35 try:36if not os.path.exists(rot): #Check if there is a rot directory, create it if not
37   os.mkdir(rot)       #Create rot directory
38 if not os.path.exists(path):    #Check if there is a path file in the path directory
39   r = requests.get(url)   #Link url address
40   # withopen(path,'wb')as f:  #Another way to open and store files(？？？)41         f =open(path,'wb') #open a file
42   f.write(r.content)  #r.content The binary content of the response URL
43   f.close()44print("Saved successfully")45else:46print("File already exists")47 except:48print("Crawl failed")49'''
50 # Crawl a single video
51'''
52 url ="http://video.pearvideo.com/mp4/adshort/20181121/cont-1480150-13269168_adpkg-ad_hd.mp4"53 rot ="E://"54 path = rot + url.split('/')[-1] #Specify the suffix name of the video save name (url) in the url address
55 try:56if not os.path.exists(rot): #Check if there is a rot directory, create it if not
57   os.mkdir(rot)       #Create rot directory
58 if not os.path.exists(path):    #Check if there is a path file in the path directory
59   r = requests.get(url)   #Link url address
60   # withopen(path,'wb')as f:  #Another way to open and store files(？？？)61         f =open(path,'wb') #open a file
62   f.write(r.content)  #r.content The binary content of the response URL
63   f.close()64print("Saved successfully")65else:66print("File already exists")67 except:68print("Crawl failed")69'''
70 # Automatic query of IP address attribution
71 # IP168 query link url: http://ip138.com/ips138.asp?ip=123.125.6.1&action=272 #From the above: IP query url: http://ip138.com/ips138.asp?ip=ip address
7374 url ="http://ip138.com/ips138.asp"75 ip ={"ip":"106.111.147.213"}76try:77     r = requests.get(url,params=ip)78     r.raise_for_status()79print(r)80     #print(r.status_code)81     r.encoding = r.apparent_encoding
82 print(r.text)83 except:84print("Crawl failed")

Small summary:

It can be seen from the example that we need to master the url interface (will find and use it) to facilitate the search: the params() function is used in the middle to add new content after the url. When the function is called, the url will be in both Add a "?" symbol at the junction

In terms of web crawling, we have also seen the appearance of r.text[] format, which means that we need to crawl the number of characters in the source code of the web page;

In the example, I saw the crawling of pictures and videos and saved them in binary form!

Beautiful Soup library: (install bs4 in pycharm) {0.2 bs.py}

Use of the library:

1 form bs4 import BeautifulSoup
23 soup =BeautifulSoup( ‘ <p>data</p> ‘ , ‘ html.parser ‘ )

The understanding of the Beautiful Soup library: a functional library for parsing, traversing, and maintaining the "tag tree"

Beautiful Soup library, also called Beautiful Soup4 or bs4

Currently commonly used citation methods: for bs4 impor tBeautiful Soup

Beautiful Soup class<<<equivalence>>> tag tree<<<equivalence>>> HTML<>

Because of the formation of equivalence, the tag tree forms a variable through the Beautiful Soup class;

Simple analogy: Beautiful Soup corresponds to the entire content of the HTML/XML document

Beautiful Soup library Parser:

Parser	How to use	Condition
HTML parser for bs4	Beautiful Soup (mk,'html.parser' )	install bs4 library
HTML parser for lxml	Beautiful Soup (mk,'lxml' )	pip install lxml
XML parser for lxml	Beautiful Soup (mk,'xml' )	pip install lxml
html5lib's parser	Beautiful Soup (mk,'html5lib' )	pip install html5lib

Basic elements of Beautiful Soup library:

Basic elements	Description
Tag	tag, the most basic information organization unit, use <> and</> Mark the beginning and end
Name	The name of the label, ... The name is'p', format: .name
Attributes	Tag attributes, organized in dictionary form, format: .attrs
NavigableString	Non-attribute string inside the tag, <>...</> Medium string, format: .string
Comment	The comment part of the string in the tag, a special Comment type

Beautiful Soup library functions: (in the table Refers to the label)

Function	Description	Remarks
soup.	Get the tags in the parsed webpage	t=soup.
[ soup. ].name	Get The name of the label
[ soup. ].parent.name	Get The name of the label above the label	This can be inferred by analogy
[ soup. ].attrs	Get Label attributes
[ soup. ].string	Get Non-attributed string/comment of label

Code example (+description)

1 r = requests.get('http://python123.io/ws/demo.html')2print(r.text)3 demo = r.text   #demo ==url webpage code
 4 soup =BeautifulSoup(demo,'html.parser')5print(soup.prettify())6 r = requests.get('http://python123.io/ws/demo.html')7 demo = r.text
 8 soup =BeautifulSoup(demo,"html.parser")9print(soup.title)   #Find title tag
10 print(soup.a.string)  #Print the string content of a tag
11 print(soup.a.name)  #Print the name of a label
12 print(soup.a.parent.name)   #Print the parent label of label a (the upper label)
13 print(soup.a.parent.parent.name)    #Print the upper label of the parent label of label a
14 tag = soup.a    #soup:Find label a
15 print(soup.a.attrs)    #Print label attributes
16 print(tag.attrs['class'])  #Print the content of&#39;class&#39; in label attributes
17 print(tag.attrs['href'])   #Print the content of&#39;href&#39; in label properties
18 print(tag)  #Print the content of a tag

HTML traversal:

Downward traversal:

Properties	Description
. contents	List of child nodes, will All son nodes are stored in the list
. children	The iteration type of child nodes, similar to .contents, used to loop through the child nodes
. descendants	Iteration type of descendants, including all descendants, used for loop traversal
1 r = requests.get('http://python123.io/ws/demo.html') 2 demo = r.text #demo == url web code 3 soup = BeautifulSoup(demo,'html.parser') 4 print (soup.head) #Get the head tag 5 print(soup.head.contents) #The return form is a list, and the list form retrieves head, 6 print(soup.body.contents) #Get the son node of the body tag (return form is list ) 7 print(len(soup.body.contents)) #Use len to test the total number of nodes of the body tag 8 print(soup.body.contents[1]) #list form return, use list to view one of the nodes 9 #traverse Son node 10 for child in soup.body.children:11 print(child)12 #Traverse the descendants node 13 for child in soup.body.descendants:14 print(child)

Upward traversal:

Properties	Description
. parent	The parent tag of the node
. parents	The iteration type of the node's ancestor label, used to loop through the ancestor node
1 r = requests.get('http://python123.io/ws/demo.html') 2 demo = r.text #demo == url web code 3 soup = BeautifulSoup(demo,'html.parser') 4 print (soup.title.parent) #title tag node parent >>> Label 5 #Upstream traversal 6 for parent in soup.a.parents: # Traverse the parent up from the a label 7 if parent is None: 8 print(parent) #If the ancestor is None, execute 9 else:10 print(parent.name) # Print the label name of the ancestor

Parallel traversal:

Properties	Description
. next_sibling	Returns the next parallel node label in the order of HTML text
. previous_sibling	Return to the previous parallel node label in the order of HTML text
. next_siblings	Iteration type, returns all subsequent parallel node labels in HTML text order
. previous_siblings	Iteration type, returns all subsequent parallel node labels in HTML text order
1 r = requests.get('http://python123.io/ws/demo.html') 2 demo = r.text #demo == url web code 3 soup = BeautifulSoup(demo,'html.parser') 4 print (soup.a.next_sibling) #a's next parallel node is the string >>>'and' 5 print(soup.a.next_sibling.next_sibling) #a's next parallel label is the next parallel label 6 print(soup .a.previous_sibling) #a's previous parallel label 7 print(soup.a.previous_sibling.previous_sibling) #>>> None (空信息) 8 #Traverse subsequent parallel nodes: 9 for sibling in soup.a.next_siblings:10 print(sibling)11 #Traverse the pre-order parallel nodes: 12 for sibling in soup.a.previous_siblings:13 print(sibling)

HTML format output based on bs4 library:

prettify() function: print label (print in HTML format)

1 r = requests.get('http://python123.io/ws/demo.html')2 demo = r.text
 3 soup =BeautifulSoup(demo,'html.parser')4print(soup.a.prettify())    #Output a tag (HTML format)
 5'''
 6< a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">7  Basic Python
 8< /a>9'''
10 soup =BeautifulSoup('<p>Chinese</p>','html.parser')11print(soup.p.string)    #To[Obtain<tag>Non-attributed string/Comment]Output label
12'''
13 Chinese
14'''
15 print(soup.p.prettify())    #Output tags in HTML format
16'''
17< p>18 Chinese
19< /p>20'''

Information organization and extraction: {0.3.py}

Information mark:

The marked information can form the information organization structure, increasing the information dimension

The marked information can be used for communication

Store or display

The structure of the mark is as important as the information

The marked information is more conducive to the understanding and application of the program

HTML information mark:

HTML (Hyper Text Markup Language): Hypertext Markup Language; it is the information organization method of WWW (World Wide Web) to embed sound, image, and video into text by means of hypertext;

HTML passes predefined <>...</> Organize different types of information in tabs

Three forms of information mark: (XML\JSON\YAML)

XML (eXtensible Markup Language): Extensible Markup Language (based on HTML)

**JSON (JavsScript Object Notation): Typed key and value pairs (key: value) expression **

YAML (YAML Ain't Markup Language): Recursive definition; untyped key-value pair Key: Value

Comparison of three information marking methods: (example)

XML example
< person> Tian Song Zhongguancun South Street 5 Beijing 100081 Computer System Security	It can be seen from the example that the proportion of effective information text is not high, and most of the information is occupied by tags

JSON example
{ "FirstName": "Tian", "lastName": "Song", "addresss": {"streetAddr": "Zhongguancun South Street 5", "city": "Beijing", "zipcode": "100081"} "Prof": ["Computer System", "Security" ]}	JSON defines related keys through key-value pairs , But use colons and curly braces to distinguish structure and key-value pairs anyway
YAML examples (YAML is concise and clear)
firstName: Tiam lastName: Song address: streetAddr: 5 Zhongguancun South Street city: Beijing zipcode: 100081 prof: -Computer System -security

Information Mark	Compare	Application
XML	The earliest general information markup language, extensible but cumbersome	Interaction and transmission of information on the Internet
JSON	Information has types, suitable for program processing (js), more concise than XML	Mobile application cloud and node information interaction, no comments
YAML	No type of information, the highest proportion of text information, good readability	Configuration files of various systems, easy to read with comments

General method of information extraction:

Method 1: Completely analyze the marked form of the information, and then extract the key information.

XML JSON YAML_ requires a tag parser, for example: the tag tree traversal of the bs4 library

Advantages: accurate information analysis

Disadvantages: the extraction process is cumbersome and slow

Method 2: Ignore the markup form and directly search for key information.

Search the text search function for information

Advantages: the extraction process is simple and fast

Disadvantages: the accuracy of the extraction results is related to the information content (lack)

Fusion method: Combine form analysis and search methods to extract key information

XML JSON YAML + search >>> requires markup parser and text search function

Examples:

1 import requests
 2 from bs4 import BeautifulSoup
 3'''# Extract all URL links in HTML
 41、 Search all<a>Tags (the content of the a tag is the url)
 52、 Parsing<a>Tag format, link content after href is extracted
 6'''
 789 r = requests.get('http://python123.io/ws/demo.html')10 demo = r.text
11 soup =BeautifulSoup(demo,'html.parser')12for link in soup.find_all('a'): #Find in the demo<a>label
13 print(link.get('href')) #in<a>Find the &quot;href&quot; function in the tag
141516'''
17 result:18   http://www.icourse163.org/course/BIT-26800119   http://www.icourse163.org/course/BIT-100187000120'''

Beautiful Soup library method:

**bs4.element.Tag: tag type; **

Method	<>.find_all (name,attrs,recursive,string,**kwargs)
Description	You can find the information in the variable of soup
Parameter	Returns a list type, storing the result of the search. Parameter description promotes the search string of name to the name of the label. If you search for multiple contents, you can use the list method attrs to search for the label attribute value. Whether the attribute index recursive can be marked Retrieve all descendants (default True) False only retrieve string<>...</> The search string in the string area string='....'
Parameters	Description
name	The search string for the name of the label
attrs	Retrieving string for label attribute value
recursive	Whether to search for all descendants (default True)
string	<>...</> The search string in the string area in the
Short search	(...) Equivalent to .find_all(..)soup (...) is equivalent to soup.find_all(..)
Extension method	Method description<>.find() searches and returns only one result, string (string) type, <>.find_parents() searches in the ancestor node, returns list type, <>.find_parent() in the ancestor node Return a node, string type, <>.find_next_siblings() searches in subsequent parallel nodes, returns list type, <>.find_next_sibling() returns a result in subsequent parallel nodes, string type <>.find_previous_siblings() first Search in order parallel nodes, return list type, <>.find_previous_sibling() returns a result in previous order parallel nodes, string type
Method	Description
<>. find()	Search and return only one result, string type,
<>. find_parents()	Search in ancestor nodes, return list type,
<>. find_parent()	Return a node in the ancestor node, string type,
<>. find_next_siblings()	Search in subsequent parallel nodes, return list type,
<>. find_next_sibling()	returns a result in subsequent parallel nodes, string type
<>. find_previous_siblings()	Search in the previous parallel nodes, return list type,
<>. find_previous_sibling()	returns a result in the previous parallel node, string type

name instance:

1 r = requests.get('http://python123.io/ws/demo.html')2 demo = r.text
 3 soup =BeautifulSoup(demo,"html.parser")4print(soup.find_all('a'))   #The list form returns the label string in soup (text) (retrieve a label)
 5 print(soup.find_all(['a','b']))6for tag in soup.find_all(True):     #Traverse all labels in soup
 7 print(tag.name) #Print (get) the name of all labels! !
 8'''
 9[< a class="py1" href="http://www.icourse163.org/course/BIT-268001"10 id="link1">Basic Python</a>,<a class="py2" href="http://www.icourse163.org
11 /course/BIT-1001870001" id="link2">Advanced Python</a>]12[<b>The demo python introduces several python courses.</b>,<a class="py1"13  href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic
14 Python</a>,<a class="py2" href="http://www.icourse163.org/course/15  BIT-1001870001" id="link2">Advanced Python</a>]16 html
17 head
18 title
19 body
20 p
21 b
22 p
23 a
24 a
25'''# result

Examples of attrs:

1 r = requests.get('http://python123.io/ws/demo.html')2 demo = r.text
 3 soup =BeautifulSoup(demo,'html.parser')4print(soup.find_all('p','course'))  #Return the P tag with the course attribute
 5 print(soup.find_all(id='link1')) #Return ID attribute==Content of link1
 67 ‘’’
 89[< p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:1011<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]1213[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]1415 ‘’’

Recursive instance:

1 r = requests.get('http://python123.io/ws/demo.html')2 demo = r.text
3 soup =BeautifulSoup(demo,'html.parser')4print(soup.find_all('a'))   #Retrieve a tag
5 print(soup.find_all('a',recursive=False))#Return an empty list, indicating that the lower node of a does not have a label (only one layer (son layer) is retrieved under False)

string instance:

1 r = requests.get('http://python123.io/ws/demo.html')2 demo = r.text
3 soup =BeautifulSoup(demo,'html.parser')4print(soup) #Output the complete HTML code parsed by the parser
5 print(soup.find_all(string ='Basic Python'))   #List returns the retrieved string information

A case of directional crawler for Chinese university ranking {0.4 bs.py}

1 import requests
 2 import bs4
 3 from bs4 import BeautifulSoup
 4 # soup =BeautifulSoup(demo,'html.parser')5 def getHTML(url):#Standard framework
 6 try:7         r = requests.get(url)8         r.raise_for_status()9         r.encoding = r.apparent_encoding
10 return(r.text)11     except:12print("Link crawling failed")1314 def uitHTML(ulist,demo):15     soup =BeautifulSoup(demo,'html.parser')16for tr in soup.find('tbody').children:1718 #.find(..)Search tag results returned.children traverse the son (lower layer) node at the same time
19 ifisinstance(tr,bs4.element.Tag):2021 #Detect the type of tr, if it is not a tag type, filter (to avoid traversing to the string content) isinstance to determine the variable type
22    tds =tr('td') #Equivalent to tr.find(...)>>>Find the td tag (the returned object is list)
23    # print(tds)         #Test and use: view tds content
24    # tds now gets the HTML format of all td tags in tr in the HTML of the url (returned in list format)
25    ulist.append([tds[0].string, tds[1].string, tds[3].string])26             #The data needed in tds(Extract only.string>>String area)Save to the ulist list
2728 def uitUlist(ulist,num):29print("{:^16}\t{:^16}\t{:^16}".format("Rank","School Name","Total score"))30for i inrange(num):31         u = ulist[i]32print("{:^16}\t{:^16}\t{:^16}".format(u[0],u[1],u[2]))3334 def main():35     ulist =[]36     url ="http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"37     demo =getHTML(url)38uitHTML(ulist,demo)39uitUlist(ulist,20)4041main()

Regular expression: re library: (https://www.cnblogs.com/wangyuyang1016/p/10034868.html)

Commonly used operators of regular expressions

.	Represents any single character
[ ]	Character set, given a range of values for a single character	[a,b,c] means a,b,c; [az] means a single character from a to z
[^]	Non-character set, the exclusion range is given for a single character	[^a,b,c] indicates a single character that is not a, b, or c
*	0 or unlimited expansion of the previous character	abc* means ab, abc, abcc, abccc, etc.
+	The previous character is expanded once or unlimited times	abc+ means abc, abcc, abccc, etc.
?	The previous character 0 or 1 expansion	abc? means ab,abc
		Any one of left and right expressions
{ m}	Expand the previous character m times	ab{2}c means abbc
{ m,n}	Expand the previous character m to n times (including n)	ab{1,2}c means abc, abbc
^	Match the beginning of the string	^abc means abc and is at the beginning of a string
$	matches the end of the string	abc$ means abc and is at the end of a string
( )	Grouping mark, only the	operator
\ d	Number, equivalent to [0-9]
\ w	word character, equivalent to [A-Za-z0-9]

Main functions of the re library

re.search()	Search for the first position of a matching regular expression in a string and return the match object
re.match()	Match the regular expression from the beginning of a string and return the match object
re.findall()	Search string, return all matching substrings in list type
re.split()	Split a string according to the regular expression matching result and return the list type
re.finditer()	Search for a string and return an iterable type of matching result, each iterable element is a match object
re.sub()	Replace all substrings matching regular expressions in a string, and return the replaced string

Detailed function function:

Syntax	re.search (pattern, string, flags=0 )
Parameter description	pattern regular expression string or native string representation string The string to be matched (string) flags The control flag when regular expression is used
pattern	String or native string representation of regular expression
string	The string to be matched (string)
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

Syntax	re.match (pattern, string, flags=0 )
Parameter description	pattern regular expression string or native string representation string The string to be matched (string) flags The control flag when regular expression is used
pattern	String or native string representation of regular expression
string	The string to be matched (string)
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

Syntax	re.findall (pattern, string, flags=0 )
Parameter description	pattern regular expression string or native string representation string The string to be matched (string) flags The control flag when regular expression is used
pattern	String or native string representation of regular expression
string	The string to be matched (string)
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

Syntax	re.split (pattern, string, maxsplit=0, flags=0 )
Parameter description	pattern regular expression string or native string represents the maximum split number of string to be matched (string) maxsplit, and the remaining part is used as the last element to output flags. The control flag when the regular expression is used
pattern	String or native string representation of regular expression
string	The string to be matched (string)
maxsplit	The maximum number of splits, the remaining part is output as the last element
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

Syntax	re.finditer (pattern, string, flags=0 )
Parameter description	pattern regular expression string or native string representation string The string to be matched (string) flags The control flag when regular expression is used
pattern	String or native string representation of regular expression
string	The string to be matched (string)
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

Syntax	re.sub (pattern, repl, string, count=0, flags=0 )
Parameter description	pattern The string of regular expression or the native string represents the repl replace the string of the matched string string the string to be matched (string) count replace the maximum matching flags The control flag when the regular expression is used
pattern	String or native string representation of regular expression
repl	Replace the string matching the string
string	The string to be matched (string)
count	The maximum number of replacement matches
flags	Control flags when using regular expressions
The flags control flag	re.I ignores the case of regular expressions. The ^ operator of re.M (multi-line matching) can treat each line of the given string as the beginning of the match. The operator can match. All characters
re.I	Ignore the case of regular expressions
The ^ operator of re.M	(multi-line matching) can treat each line of the given string as the beginning of the match
re.S	Regular. Operator can match all characters

re.compile(): Object-oriented usage (operation):

Syntax	regex = re.compile( pattern, flags = 0)
Description	·Compile the string form of the regular expression into a regular expression object (object) pattern Regular expression string or native string representation (regular expression method) flags The control flag match = regex when the regular expression is used. search('string') #compile() The re function function after compile() is the same as before
pattern	String or native string representation of regular expressions (regular representation method)
flags	Control flags when using regular expressions

Properties of the Match object

Properties	Description
. string	Text to be matched
. re	pattern object (regular expression) used in matching
. pos	The starting position of the regular expression search text
. endpos	The end position of the regular expression search text

Match object method

Method	Description
. group(0)	Get the matched string
. start()	The matching string is at the beginning of the original string
. end()	The matching string is at the end of the original string
. span()	return (.start(), .end())

Examples:

1 import re
 2 match = re.search(r'[1-9]\d{5}','BIT 100081')3 #Properties of the Match object
 4 print(match.string) #Returns the string to be matched when matching
 5 print(match.re)     #Return the re expression when matching
 6 print(match.pos)    #Returns the starting position of the matched search text
 7 print(match.endpos) #Returns the position where the matched search text ends
 8 # Match object method
 9 print(match.group(0))   #Return the matched string (the first match result)
10 print(match.start())    #Match the original string
11 print(match.end())      #Match the end of the original string
12 print(match.span())     #Return as a tuple.start()with.end()

Greedy matching and minimum matching of the re library

1 match = re.search(r'PY.*N','PYANBNCNDN') #Greedy matching (re library default greedy matching, that is, the longest substring matching output)
2 print(match.group(0))3 match = re.search(r'PY.*?N','PYANBNCNDN') #Minimum match
4 print(match.group(0))

Minimum matching operator

Operator	Description
*?	The previous character is expanded 0 times or unlimited times, the smallest match
+?	The previous character is expanded 1 time or unlimited times, the smallest match
??	0 or 1 expansion of the previous character, minimum match
{ m,n}?	Expand the previous character m to n times (including n), the smallest match