The advantages and necessity of learning crawlers
Python crawler is to simulate the browser to open the webpage and obtain some data needed in the webpage.
Learning Python crawler is not only full of fun, but also provides basic knowledge of Python programming language. It can be said to be a shortcut to entry into the IT industry, to achieve the two-in-one entertainment and learning. Like reading novels and funny pictures? Looking for a job is still screening corporate needs one by one! There is no reference data for operations and data analysis! In my spare time, I want to make a small demand for a crawler to earn "pocket money", and the crawler will help you get it done quickly.
Python crawlers are recognized as easy to learn, easy to use, and full of fun. Among them, this series of articles will contain theoretical knowledge + diagram code, case + content summary, if you want to deepen your understanding of which knowledge points, you can leave a message in the comment area .
To learn Python crawlers, you must use Python software. Anaconda has its own python compiler, which integrates many Python libraries. Configuration and installation are very convenient. Very suitable for introductory learning.
Anaconda software installation official website address: https://www.anaconda.com/products/individual
If you can use your spare time and office is inconvenient, we recommend two online code Python3.0 browser links.
We have prepared our "sharp weapon" tools, and now we are going to teach "gong method secrets". What is a crawler, and how does a crawler crawl data? What is the basic principle of crawlers?
A web spider is a program or script that requests a website according to certain rules and automatically grabs data and information.
Initiating a request through the HTTP library target site, that is, sending a request, the request can contain additional headers and other information, waiting for the server to respond.
When we open a website link, the process is to send a request from the client (for example: Google, Firefox) to the server (for example: the server where you open the Baidu website), the server receives the request, processes it, and returns it to the client ( Browser), and then saw the displayed data on the browser.
Among them, Elements is to find the source code of the webpage and edit the DOM node and CSS style in real time. After the request of the webpage is initiated, Network analyzes the information of each request resource obtained by the HTTP request. The parameter value in Network is the main content of our study.
Network related parameters are as follows:
Request refers to a request. Enter the link address in the browser and click the search [or press the enter key] to send a request.
There are mainly two types of request methods: Get and Post, as well as Head, Put, Delete, Options, etc. Because the most commonly used methods for crawlers are Post and Get, and other methods are hardly involved, so a brief introduction is given.
Examples of requests:
# The url address can be directly specified as the link address when requests are requested
requests.get(url='https://www.baidu.com/s?ie=utf-8&wd=Bald Girl 25')
The requested URL [uniform/universal resource identifier] is the full name of the uniform resource locator. The URL is mainly used for two purposes, naming resources and providing the path or location of the resource. In this case, it is called a uniform resource locator, such as a document or picture of a web page , Videos can be specified by URL.
The HTTP protocol: http:// or https://; the link address of the server, for example:
http//baidu.com
Searching for bald girl 25 in the browser, the URL link address has changed, s?ie=utf-8&wd=bald girl 25 in the link specifies the [utf-8] code, and [bald girl 25] specifies the search keyword for you.
Specify the URL address as follows:
url ="https://www.baidu.com/s?ie=utf-8&wd=Bald Girl 25"
The request header refers to the header information when requesting, such as User-Agent, host, Cookies and other information. The request body refers to the additional data carried during the request, such as the form data when the form is submitted. Many websites cannot be accessed without a request header when applying for access, or return garbled codes. The simple solution is to pretend to be a browser for access, such as adding a request header to pretend the behavior of the browser.
Examples of requests header:
# Define request headers as dictionary type
headers ={"User-Agent":"Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
If the server can respond normally, it will get a Response. The content of the Response is the content of the page to be obtained. The types may be HTML, Json strings, binary data (such as pictures and videos), etc.
Next, let us combine the request, request header and return to complete a simple request response.
First, you need to install the network request requests module imported into Python [this module needs to be installed in the terminal using pip install requests].
Find the request header in the access link and define it as a dictionary. Use the Get request method to pass in the link address and request header to get the response content. The result returned by response is the access status, such as:<Response [200]> , Response.text returns the entire text content.
The code example is as follows:
import requests
# Define request headers as dictionary type
headers ={"User-Agent":'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
# Send network request requests and get the response result response
response = requests.get(url='https://www.baidu.com/s?ie=utf-8&wd=Bald Girl 25',headers = headers)
# Print response will output the return status, print response.text will output web page text
print(response.text)
The operation using the Python online editor is shown as follows:
requests (initiating a request), response (obtaining a response), get (reading data, requesting the specified page information), post (submitting data to the server), url (uniform resource locator, specifying the document, image, and video of the webpage) , Hearders (header information when requested).
Recommended Posts