Python crawler basic knowledge points finishing

**What is a crawler first? **

Web crawlers (also known as web spiders, web robots, in the FOAF community, and more often web chases) are programs or scripts that automatically crawl information on the World Wide Web in accordance with certain rules.

According to my experience, to learn Python crawler, we have to learn the following points:

Basic knowledge of Python
Usage of urllib and urllib2 library in Python
Python regular expression
Python crawler framework Scrapy
More advanced features of Python crawler

Python basic learning

First of all, we need to use Python to write crawlers. We must understand the basics of Python. The foundation is built on the ground, and you can’t forget the foundation. Haha, then I will share some Python tutorials that I have read. Friends can use it as a reference. .

Python learning network

There are a lot of free Python introductory tutorials online for everyone to learn. There are not only video tutorials, but also corresponding Q&A sections to help you solve problems in the learning process. The effect is quite good. The content is basically the most basic. If you start to get started, this is it.

Liao Xuefeng Python Tutorial

Later, I found Teacher Liao's Python tutorial. It was very easy to understand and felt very good. If you want to learn more about Python, please take a look at this.

Concise Python tutorial

There is also a concise Python tutorial that I have seen, and it feels good

Learning URL: Concise Python tutorial (https://woodpecker.org.cn/abyteofpython_cn/chinese/pr01.html#s01)

Wang Hai's laboratory

This is the senior of my undergraduate laboratory. I referred to his articles when I started, and made a new summary. Later, these series of articles added some content on his basis.

Learning website: Wang Hai’s laboratory (https://blog.csdn.net/wxg694175346/category_1418998_1.html)

Usage of Python urllib and urllib2 library

The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. Using this library, we can get the content of the web page, and extract and analyze the content with regular expressions to get the results we want. I will share this with you during the learning process.

Python regular expression

Python regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that meets the rule is considered to be "matched", otherwise, the string is illegal. This will be shared in a later blog post.

Scrapy

If you are a Python expert and have mastered the basic crawler knowledge, then look for the Python framework. The framework I chose is the Scrapy framework. What are the powerful features of this framework? The following is its official introduction:

Built-in support for HTML, XML source data selection and extraction
Provides a series of reusable filters (ie Item Loaders) shared between spiders, and provides built-in support for intelligent processing of crawling data.
Provides built-in support for multiple formats (JSON, CSV, XML) and multiple storage backends (FTP, S3, local file system) through feed export
Provides a media pipeline, which can automatically download pictures (or other resources) in the crawled data.
High scalability. You can customize your functions by using signals and designed APIs (middleware, extensions, pipelines).
The built-in middleware and extensions provide support for the following functions:
cookies and session handling
HTTP compression
HTTP authentication
HTTP cache
user-agent simulation
robots.txt
Crawl depth limit
For non-standard or wrong coding declarations in non-English languages, automatic detection and robust coding support are provided.
Support generating crawlers based on templates. While speeding up the creation of crawlers, keep the code in large projects more consistent. See the genspider command for details.
For performance evaluation and failure detection under multiple crawlers, an extensible state collection tool is provided.
Provide an interactive shell terminal, which provides great convenience for you to test XPath expressions, write and debug crawlers
Provide System service to simplify deployment and operation in the production environment
Built-in Web service, allowing you to monitor and control your machine
Built-in Telnet terminal, by hooking into the Python terminal in the Scrapy process, you can view and debug the crawler
Logging provides convenience for you to catch errors during the crawling process
Support sitemaps crawling
DNS resolver with cache

Official document: http://doc.scrapy.org/en/latest/

When we master the basic knowledge, let's use this Scrapy framework!

After so much tugging, it seems that there are not many useful things, so let's not tugging!

Knowledge point expansion:

Basic principles of crawlers

Crawler is a program that simulates the user's operation on the browser or App application, and realizes the operation process and automation program

When we enter a url in the browser and press Enter, what happens in the background? For example, if you enter https://www.baidu.com

Simply put, the following four steps occurred in this process:

Find the IP address corresponding to the domain name.
The browser first visits the DNS (Domain Name System). The main job of dns is to convert the domain name into the corresponding IP address and send a request to the server corresponding to the IP.
The server responds to the request and sends back the content of the web page.
The browser displays the content of the web page.

What a web crawler needs to do, in simple terms, is to implement the functions of a browser. By specifying the url, the data required by the user is directly returned, without the need to manually manipulate the browser step by step to obtain it.

So far this article on the basic knowledge of python crawlers is introduced here. For more information about Python 2 crawlers, please search ZaLou.Cn