With the advent of the era of big data, data will become one of our most important energy sources just like coal-electric oil. However, this kind of energy can be continuously produced and renewable. As a key part of obtaining data, Python crawlers play an extremely important role in the era of big data. So many students came to consult: Is Python crawler easy to learn?
What is a crawler?
Web crawlers, also known as web spiders, web robots, are programs or scripts that automatically crawl information on the World Wide Web according to certain rules.
**Where does the data come from? **
If you want to learn Python, first ask: Where does the data we crawl come from?
User data generated by enterprises: Baidu Index, Ali Index, TBI Tencent Browsing Index, Sina Weibo Index;
Data platform purchase data: Datatang, Guoyun Data Market, Guiyang Big Data Exchange;
Government/institution public data: data from the National Bureau of Statistics of the People's Republic of China, data from the World Bank, data from the United Nations, Nasdaq;
Data management consulting companies: McKinsey, Accenture, iResearch;
Crawling network data: If the data you need is not available in the market, or you are unwilling to buy it, you can choose to recruit/be a crawler engineer and do it yourself.
How to grab page data?
Three characteristics of the webpage:
Web pages have their own unique URL (Uniform Resource Locator) to locate;
Web pages use HTML (hypertext markup language) to describe page information;
Web pages use HTTP/HTTPS (Hypertext Transfer Protocol) protocol to transmit HTML data;
Crawler design ideas:
First determine the URL address of the web page that needs to be crawled.
Obtain the corresponding HTML page through HTTP/HTTP protocol.
Extract useful data from HTML pages:
a. If it is the required data, save it.
b. If it is another URL on the page, proceed to the second step.
Conclusion: Python crawler learning is actually a basic entry-level part of the Python learning process. It is not difficult to learn, but it is indeed one of the indispensable skills in professional ability. ,
Content expansion:
A simple crawler example:
import urllib,urllib2
import re
def geturllist():
# Do not visit the website, but instance an object, in order to simulate the browser accessing the server
req = urllib2.Request("http://www.budejie.com/video/")
# Add the header for requesting access to make the other server mistakenly believe that it is a browser requesting access (parameters are copied through the browser)
req.add_header('User-Agent',' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')
# Open the instance object I just created
res =urllib2.urlopen(req)
html = res.read()
print html #Resource code
# Define a regular expression in order to get the video URL I want
reg = r'data-mp4="(.*?)" '
# Find out the video URL in the webpage source code
urllist = re.findall(reg,html)
# print urllist
# There are 20 video URLs, download them one by one with a for loop
n =1for url in urllist:
# url video URL,'%s.mp4'Downloaded name, url.split('/')[-1]Follow the string as'/'separate
urllib.urlretrieve(url,'%s.mp4'%url.split('/')[-1]) #Download video
n = n+1geturllist()
This is the end of this article on Is python crawler easy to learn? For more related python crawlers, please search ZaLou.Cn
Recommended Posts