Python know crawler and anti crawler

Reference material: Python crawler, do you really understand it? : https://www.bilibili.com/read/cv4144658

Crawler: Get information through the computer to save labor costs. If you don't save, you don't need to use it.

The ultimate anti-crawler: distinguish between computers and people, so as to eliminate computer access and allow people to access.

The final conclusion: reptiles and anti-reptiles have an end.

 The end of the crawler is the extreme user simulation (automation).

 The end of the anti-reptile is the verification code that the machine cannot recognize but the human can recognize.

 So, to save trouble, why not just learn one trick to automate? It seems to be ok, although it is a bit slow. Obtain key information with automation, and then go to concurrent requests.

Common anti-reptile measures:

1、 Visit frequency

If you visit too frequently, the website may be blocked for your IP for a period of time. This is the same as the principle of anti-DDoS. For crawlers, it is enough to limit the frequency of tasks like this.

Let the crawler visit the webpage like a human frequently, sleep for 5 seconds and 10 seconds.

2、 Login restrictions

Websites that disclose information generally do not have this restriction, which makes users troublesome. Here is avoided by simulated login, adding a Cookie.

3、 Block through header

Just add a header, and a request header can be randomly generated by faker.

4、 JavaScript script to dynamically obtain website data (upgrade)

The content of some websites (especially single-page websites) is not directly returned by the server, but the server only returns a client JS program, and then JS gets the content. More Advanced

The thing is, js calculates a token locally, and then uses this token for ajax to go to the content. The local js is code obfuscated and encrypted, which will increase the difficulty of parsing the request.

However, it can be easily cracked by directly simulating the browser operation.

5、 Verification Code (Ultimate Weapon)

The verification code is a means specially used to distinguish between people and computers. For anti-climbing, it needs to be able to solve the problem of verification code. The common verification code, Google's reCAPTCHA, is extremely proof.

6、 ip restriction

The IP that may be identified by the website is permanently blocked. This method requires a lot of manpower and the cost of killing users by mistake is high. The cracking method is to use a proxy pool.

7、 Anti-crawl website content

Some websites present website content in a form that only humans can receive, such as displaying the content in the form of pictures. Image recognition can use ocr. For example: data returned by a single link request

It is only a part of the encoding of the picture, and a complete picture can be obtained after multiple url return results are combined.

Recommended Posts

Python know crawler and anti crawler
Python crawler | Cognitive crawler request and response
Mongodb and python interaction of python crawler
Python and Go
Python3 crawler learning.md
Python web crawler (practice)
Python introspection and reflection
[python] python2 and python3 under ubuntu
python_ crawler basic learning
Python deconstruction and packaging
Python3 configuration and entry.md
Python crawler gerapy crawler management
Python automated operation and maintenance 2
centos7 install python3 and ipython
Know Linux and install CentOS
ubuntu18.04 compile and install python3.8
Python3 crawler data cleaning analysis
Analysis of JS of Python crawler
Centos 6.10 reinstall python and yum
Python open read and write
CentOS7 install python3 and pip3
Python automated operation and maintenance 1
Python data structure and algorithm
Python multi-process and multi-thread basics
CentOS 6.9 compile and install python
CentOS 6 compile and install python 3
Selenium visual crawler for python crawler
Generators and iterators in Python
Python and js interactive call method
Magic methods and uses of Python
Python judges positive and negative numbers
python ftp upload files and folders
FM algorithm analysis and Python implementation
Common errors and solutions in python
Python implements string and number splicing
Python function definition and parameter explanation
Python crawler basic knowledge points finishing
Scrapy simulation login of Python crawler
CentOS quickly install Python3 and pip3
Install Python3 and ansible under CentOS8
Python processing PDF and CDF examples
Played with stocks and learned Python
Configure python3 environment on centos7 and
Python reads and writes json files
Python implements username and password verification
Install Python3 and Py under CentOS7
Learning path of python crawler development
Is python crawler easy to learn
Python basic syntax and number types
Python learning os module and usage