Python know crawler and anti crawler

Reference material: Python crawler, do you really understand it? : https://www.bilibili.com/read/cv4144658

Crawler: Get information through the computer to save labor costs. If you don't save, you don't need to use it.

The ultimate anti-crawler: distinguish between computers and people, so as to eliminate computer access and allow people to access.

The final conclusion: reptiles and anti-reptiles have an end.

 The end of the crawler is the extreme user simulation (automation).

 The end of the anti-reptile is the verification code that the machine cannot recognize but the human can recognize.

 So, to save trouble, why not just learn one trick to automate? It seems to be ok, although it is a bit slow. Obtain key information with automation, and then go to concurrent requests.

Common anti-reptile measures:

1、 Visit frequency

If you visit too frequently, the website may be blocked for your IP for a period of time. This is the same as the principle of anti-DDoS. For crawlers, it is enough to limit the frequency of tasks like this.

Let the crawler visit the webpage like a human frequently, sleep for 5 seconds and 10 seconds.

2、 Login restrictions

Websites that disclose information generally do not have this restriction, which makes users troublesome. Here is avoided by simulated login, adding a Cookie.

3、 Block through header

Just add a header, and a request header can be randomly generated by faker.

4、 JavaScript script to dynamically obtain website data (upgrade)

The content of some websites (especially single-page websites) is not directly returned by the server, but the server only returns a client JS program, and then JS gets the content. More Advanced

The thing is, js calculates a token locally, and then uses this token for ajax to go to the content. The local js is code obfuscated and encrypted, which will increase the difficulty of parsing the request.

However, it can be easily cracked by directly simulating the browser operation.

5、 Verification Code (Ultimate Weapon)

The verification code is a means specially used to distinguish between people and computers. For anti-climbing, it needs to be able to solve the problem of verification code. The common verification code, Google's reCAPTCHA, is extremely proof.

6、 ip restriction

The IP that may be identified by the website is permanently blocked. This method requires a lot of manpower and the cost of killing users by mistake is high. The cracking method is to use a proxy pool.

7、 Anti-crawl website content

Some websites present website content in a form that only humans can receive, such as displaying the content in the form of pictures. Image recognition can use ocr. For example: data returned by a single link request

It is only a part of the encoding of the picture, and a complete picture can be obtained after multiple url return results are combined.