Welcome to pay attention to "Shengxin Practice Manual"!
A web crawler is a computer program that automatically downloads data from a website and formats it. In recent years, the position of a web crawler engineer has been quite popular. As an all-rounder, python is not a problem for crawler development.
To develop a web crawler, we need the following foundations
1. Web content download
The first task of the crawler is to be able to grab data from the website. In python, the commonly used modules are as follows
-
urllib
-
request
-
selenium
Urllib is a built-in module that provides basic download functions. Request is a third-party module and provides a more convenient interface. Selenium is a module for automated browser testing and is suitable for processing dynamic web page crawling.
2. html content cleaning
What we need is only part of the content in the web page, so after downloading, we need to perform data cleaning work to extract the information we need from the original data. The commonly used extraction techniques are as follows:
-
Regular expression
-
xpath expression
In actual use, data can also be extracted through third-party modules such as beautifulsoup.
3. Storage of database content
For a large amount of data, you can store the extracted data in the database to improve retrieval efficiency. At this time, you need to use python to communicate with the database. Commonly used databases are the following
-
sqlite
-
mysql
-
monogodb
In actual development, in order to deal with the anti-crawler mechanism of the website, we need to master more skills, such as user agent, IP proxy, cookie account login, web page capture analysis, etc. The following is a summary of the crawler and anti-crawler The mechanism of the contest between you and me
It also shows us clearly the path of learning crawler development. In subsequent chapters, I will update the relevant content according to this map.
·end·
—If you like it, share it with your friends —
Original is not easy, welcome to collect, like and forward! The knowledge of Shengxin is as vast as the sea. On the road of Shengxin learning, let us fight together!
This official account has been deeply engaged in the field of life and information for many years, has rich data analysis experience, is committed to providing truly valuable data analysis services, and is good at personalized analysis. Teachers and students in need are welcome to consult.
More exciting
- [ KEGG database, what else do you know in addition to pathway](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013168&idx=1&sn=42ca1e0c53f395dba9e36b70389a858a&chksm=b809f9c3db7ecenebscenebc8dbc8953295372358a&chksm=b809f9c3db7ec8152d
- [ The most complete circos Chinese tutorial on the entire network](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013172&idx=1&sn=b66bd98268e4d19c9184d775267b08e1&chksm=809f9c39b08e1&chksm=809f90scene55#f40f39a09573889deredirectchaf0f39a09d43
- [ DNA methylation data analysis topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013235&idx=1&sn=c37ac8a68912be421dac9d68a2242f6e&chksm=809f9cfeb7e52e84c604cagecha directcha13e72e64e13e72e66e13e72e84c604cca13e66e15
- [ Mutation detection data analysis topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013237&idx=1&sn=3481861b941d843490222bac831bfbd8&chksm=809f9cf8b7e815ee05e7eebe2edcene90656601& directscene09062190626redirect&cene0f9cf8b7e/
- [ mRNA Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013346&idx=1&sn=6ab001b77e3a5e905c3eef26c160a5b3&chksm=809f9f6fb7e81679fdirected&chksm=809f9f6fb7e81679fccd60742d&cd60742d&cd60gd&idx=1&sn=6ab001b77e3a5b3e
- [ lncRNA Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013348&idx=1&sn=7c5b0fb4e2a7d00eb042b5b03a0a110f&chksm=809f9f69b7e8167f00853098ebeneb4e8167f00853098ebenef&chksm=809f9f69b7e8167f00853
- [ circRNA Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013350&idx=1&sn=70e7c32600f4000a3d8d75fabeeb9008&chksm=809f9f6bb7e8167b9437eca19ef6ef6bb7e8167dc9437eca19eca35e26e26e26e8167dc977993e
- [ miRNA data analysis topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013352&idx=1&sn=c09c04d3584c6c160fd597a29d67a2d7&chksm=809f9f65b7e81750673dbafd1ef26592sceneb afc951c af26204achaa directly
- [ Single-cell transcriptome data analysis topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013354&idx=1&sn=a5a26381db85e0cd838b9bf5e461ab48&chksm=809f9f67b7e717fa2e53a51scc
- [ Chip_seq Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013356&idx=1&sn=97e83cb334e3e5bfae55574d72566ba9&chksm=809f9f61b7e8120185a3ae6d956a48echat directchachae5
- [ Hi-C Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013365&idx=1&sn=b5ccfbeced1b3cca83577ae925370145&chksm=809f9f78b7e8166e2530774fff9f78b7e8166e25e3d4f9f78b7e8166e3cca83577ae925370145#
- [ HLA Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013366&idx=1&sn=c3b541bd35ff9f4efa0d475bdfa2f8bb&chksm=809f9f7bb7e8166d2ec14181439b26d4ceneb4eccha directon=809f9f7bb7e8166d2ab6c1417
- [ TCGA Tumor Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013367&idx=1&sn=5d42ef189db00fc14542d0182e168d2a&chksm=809f9f7ab7e816698ec5fd10888cbenebeneb4e65e24e64efatchat24e24e8e8e65efatchat24e24e8e4
- [ Genome Assembly Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013378&idx=1&sn=f7115af4e299cd19596e4bd55c29811f&chksm=809f9f0fb50147e8161908094ddc40d338scfccha direct=809f9f0fb50147e8161908094f0fb50147e8161908094ddc21d
- [ CNV Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013505&idx=1&sn=5531f69b1ea8d0477fd72c735079081a&chksm=809f9f8cb7e8169a467f8cb7e8169a467f8cb7e8d7169a467fee76f16d2d3d2d2c735079081e
- [ GWAS Data Analysis Topic](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013509&idx=1&sn=7c47e8fd9a440bed9240b4b282b6e0d3&chksm=809f8f9f88b66eef_directasceneb2810282027926efatchatchachae_f8f9e0d3&chksm=809f9f88b6e0d3&chksm=809f9f88b2
- [2018 Collection of Tweets of the Year](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456010524&idx=1&sn=dd58266dc05f384f57254832e7793f14&chksm=809fea51b7e8634740642983ebwebcadirect94827e53d8e86347406429837f8f
- [2019 Collection of Tweets of the Year](http://mp.weixin.qq.com/s?__biz=MzIwODA1MzI4Mg==&mid=2456013811&idx=1&sn=1f9c4f914dda98377c41641eeab93a58&chksm=809f9ebbeb7e817a83b972c7eacfdeb7e817a83b972c7eacfdeb7e817a83b972c7eacfdeb7e817a83b972c7eacfdeb7e817a83b972c7eacfdeb7e817a83d