Reference material: Python crawler, do you really understand it? : https://www.bilibili.com/read/cv4144658
Crawler: Get information through the computer to save labor costs. If you don't save, you don't need to use it.
The ultimate anti-crawler: distinguish between computers and people, so as to eliminate computer access and allow people to access.
The final conclusion: reptiles and anti-reptiles have an end.
The end of the crawler is the extreme user simulation (automation).
The end of the anti-reptile is the verification code that the machine cannot recognize but the human can recognize.
So, to save trouble, why not just learn one trick to automate? It seems to be ok, although it is a bit slow. Obtain key information with automation, and then go to concurrent requests.
Common anti-reptile measures:
1、 Visit frequency
If you visit too frequently, the website may be blocked for your IP for a period of time. This is the same as the principle of anti-DDoS. For crawlers, it is enough to limit the frequency of tasks like this.
Let the crawler visit the webpage like a human frequently, sleep for 5 seconds and 10 seconds.
2、 Login restrictions
Websites that disclose information generally do not have this restriction, which makes users troublesome. Here is avoided by simulated login, adding a Cookie.
3、 Block through header
Just add a header, and a request header can be randomly generated by faker.
4、 JavaScript script to dynamically obtain website data (upgrade)
The content of some websites (especially single-page websites) is not directly returned by the server, but the server only returns a client JS program, and then JS gets the content. More Advanced
The thing is, js calculates a token locally, and then uses this token for ajax to go to the content. The local js is code obfuscated and encrypted, which will increase the difficulty of parsing the request.
However, it can be easily cracked by directly simulating the browser operation.
5、 Verification Code (Ultimate Weapon)
The verification code is a means specially used to distinguish between people and computers. For anti-climbing, it needs to be able to solve the problem of verification code. The common verification code, Google's reCAPTCHA, is extremely proof.
6、 ip restriction
The IP that may be identified by the website is permanently blocked. This method requires a lot of manpower and the cost of killing users by mistake is high. The cracking method is to use a proxy pool.
7、 Anti-crawl website content
Some websites present website content in a form that only humans can receive, such as displaying the content in the form of pictures. Image recognition can use ocr. For example: data returned by a single link request
It is only a part of the encoding of the picture, and a complete picture can be obtained after multiple url return results are combined.
Recommended Posts