01 Crawl web pages quickly

1.1 urlopen() function##

import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
data=file.read()
fhandle=open("./1.html","wb")
fhandle.write(data)
fhandle.close()

There are 3 common ways to read content, the usage is:
file.read() reads the entire content of the file, and assigns the read content to a string variable
file.readlines() reads the entire content of the file, and assigns the read content to a list variable
file.readline() reads one line of the file

1.2 urlretrieve() function##

The urlretrieve() function can directly write the corresponding information into a local file.

import urllib.request
filename=urllib.request.urlretrieve("http://edu.51cto.com",filename="./1.html")
# urlretrieve()During the execution process, some cache will be generated, you can use urlcleanup()Clean up
urllib.request.urlcleanup()

1.3 Other common usage in urllib#

import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
# Get information related to the current environment
print(file.info())
 
# Bdpagetype:1
# Bdqid:0xb36679e8000736c1
# Cache-Control:private
# Content-Type: text/html;charset=utf-8
# Date: Sun,24 May 202010:53:30 GMT
# Expires: Sun,24 May 202010:52:53 GMT
# P3p: CP=" OTI DSP COR IVA OUR IND COM "
# P3p: CP=" OTI DSP COR IVA OUR IND COM "
# Server: BWS/1.1
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7D3962001F33E17C6:FG=1; expires=Thu,31-Dec-3723:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BIDUPSID=D5BBF02F4454CBA7D3962001F33E17C6; expires=Thu,31-Dec-3723:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: PSTM=1590317610; expires=Thu,31-Dec-3723:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7FDDF8A87AF5416A6:FG=1; max-age=31536000; expires=Mon,24-May-2110:53:30 GMT; domain=.baidu.com; path=/; version=1; comment=bd
# Set-Cookie: BDSVRTM=0; path=/
# Set-Cookie: BD_HOME=1; path=/
# Set-Cookie: H_PS_PSSID=31729_1436_21118_31592_31673_31464_31322_30824; path=/; domain=.baidu.com
# Traceid:1590317610038396263412927153817753433793
# Vary: Accept-Encoding
# Vary: Accept-Encoding
# X-Ua-Compatible: IE=Edge,chrome=1
# Connection: close
# Transfer-Encoding: chunked

# Get the status code of the current crawled webpage
print(file.getcode())                     
# 200

# Get the URL address currently crawled
print(file.geturl())                      
# ' http://www.baidu.com'

Generally speaking, only some ASCII characters such as numbers, letters, and some symbols are allowed in the URL standard, while other characters, such as Chinese characters, do not conform to the URL standard. In this case, URL encoding is needed to solve it.

import urllib.request
print(urllib.request.quote("http://www.baidu.com"))
# http%3A//www.baidu.comprint(urllib.request.unquote("http%3A//www.baidu.com"))
# http://www.baidu.com

02 Browser simulation-Header attribute#

In order to prevent others from maliciously collecting their information, some webpages have carried out some anti-crawler settings. When we crawled, a 403 error would appear.
You can set some Headers information to simulate the browser access to these websites.
There are two setting methods that allow crawlers to simulate browser access.

br

2.1 Use build_opener() to modify the header##

import urllib.request

url="http://www.baidu.com"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
opener = urllib.request.build_opener()
opener.addheaders =[headers]
data=opener.open(url).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

2.2 Use add_header() to add headers##

import urllib.request

url="http://www.baidu.com"
req=urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

03 Timeout setting#

When visiting a webpage, if the webpage does not respond for a long time, the system will determine that the webpage has timed out, that is, the webpage cannot be opened.

import urllib.request

# timeout sets the timeout time, in seconds
file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=1)
data = file.read()

04 Proxy server#

When using a proxy server to crawl the content of a certain website, what is displayed on the other party’s website is not our real IP address, but the IP address of the proxy server. In this way, even if the other party blocks the displayed IP address, it doesn’t matter, because We can change to another IP address to continue crawling.

def use_proxy(proxy_addr,url):import urllib.request
 proxy= urllib.request.ProxyHandler({'http':proxy_addr})
 opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
 urllib.request.install_opener(opener)
 data = urllib.request.urlopen(url).read().decode('utf-8')return data
proxy_addr="xxx.xx.xxx.xx:xxxx"
data=use_proxy(proxy_addr,"http://www.baidu.com")print(len(data))

Use urllib.request.install_opener() to create a global opener object, then the opener object we installed will also be used when using urlopen().

05 Cookie

Using only the HTTP protocol, when we log in to a website, if the login is successful, but when we visit other pages of the website, the login status will disappear. At this time, we need to log in once, so we need to change the corresponding Session information, such as login success, is saved in some ways.
There are two commonly used methods:
1 ) Save session information through Cookie
2 ) Save session information through Session
However, no matter which method is used for session control, cookies are used most of the time.
A common procedure for cookie processing is as follows:
1 ) Import the cookie processing module http.cookiejar.
2 ) Use http.cookiejar.CookieJar() to create a CookieJar object.
3 ) Use HTTPCookieProcessor to create a cookie processor and use it as a parameter to construct an opener object.
4 ) Create a global default opener object.

import urllib.request
import urllib.parse
import http.cookiejar
url ="http://xx.xx.xx/1.html"
postdata = urllib.parse.urlencode({"username":"xxxxxx","password":"xxxxxx"}).encode("utf-8")
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
# Use http.cookiejar.CookieJar()Create CookieJar object
cjar = http.cookiejar.CookieJar()
# Use HTTPCookieProcessor to create a cookie processor and use it as a parameter to construct an opener object
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
# Create a global default opener object
urllib.request.install_opener(opener)
file = opener.open(req)

data=file.read()
fhandle=open("./4.html","wb")
fhandle.write(data)
fhandle.close()

url1 ="http://xx.xx.xx/2.html"
data1= urllib.request.urlopen(url1).read()
fhandle1=open("./5.html","wb")
fhandle1.write(data1)
fhandle1.close()

06 DebugLog

Print the debug log while executing the program.

import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://www.baidu.com")

07 Exception handling-URLError

import urllib.request
import urllib.error
try:
 urllib.request.urlopen("http://blog.baidusss.net")
except urllib.error.HTTPError as e:print(e.code)print(e.reason)
except urllib.error.URLError as e:print(e.reason)

import urllib.request
import urllib.error
try:
 urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:ifhasattr(e,"code"):print(e.code)ifhasattr(e,"reason"):print(e.reason)

08 HTTP protocol request combat#

HTTP protocol requests are mainly divided into 6 types, and the main functions of each type are as follows:
1 ) GET request: GET request will pass the information through the URL, you can write the information to be passed directly in the URL, or it can be passed by the form.
If you use a form for delivery, the information in this form will be automatically converted to the data in the URL address and passed through the URL address.
2 ) POST request: You can submit data to the server, which is a more mainstream and safer data transfer method.
3 ) PUT request: request the server to store a resource, usually to specify the storage location.
4 ) DELETE request: request the server to delete a resource.
5 ) HEAD request: request to obtain the corresponding HTTP header information.
6 ) OPTIONS request: get the request type supported by the current URL
In addition, there are TRACE requests and CONNECT requests. TRACE requests are mainly used for testing or diagnosis.

8.1 GET request example##

Using a GET request, the steps are as follows:
1 ) Construct a corresponding URL address, which contains information such as the field name and field content of the GET request.
GET request format: http://URL? Field1=Field content&Field2=Field content
2 ) Construct a Request object with the corresponding URL as a parameter.
3 ) Open the constructed Request object through urlopen().
4 ) Follow-up processing operations as required.

import urllib.request

url="http://www.baidu.com/s?wd="
key="Hello there"
key_code=urllib.request.quote(key)
url_all=url+key_code
req=urllib.request.Request(url_all)
data=urllib.request.urlopen(req).read()
fh=open("./3.html","wb")
fh.write(data)
fh.close()

8.2 POST request example##

To use POSt request, the steps are as follows:
1 ) Set the URL address.
2 ) Construct form data and use urllib.parse.urlencode to encode the data.
3 ) Create a Request object, the parameters include the URL address and the data to be passed.
4 ) Use add_header() to add header information to simulate browser crawling.
5 ) Use urllib.request.urlopen() to open the corresponding Request object to complete the transfer of information.
6 ) Follow-up processing.

import urllib.request
import urllib.parse

url ="http://www.xxxx.com/post/"
postdata =urllib.parse.urlencode({"name":"[email protected]","pass":"xxxxxxx"}).encode('utf-8') 
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("D:/Python35/myweb/part4/6.html","wb")
fhandle.write(data)
fhandle.close()

Python web crawler (practice)