Interface to crawl the webpage itself
Compared with other static programming languages, such as java, c#, C++, python, the interface to grab web documents is more concise; compared with other dynamic scripting languages, such as perl, shell, python's urllib2 package provides a more complete access to web documents API. (Of course ruby is also a good choice)
In addition, crawling web pages sometimes needs to simulate the behavior of the browser, and many websites are blocked for blunt crawling. This is where we need to simulate the behavior of the user agent to construct appropriate requests, such as simulating user login, simulating session/cookie storage and setting. There are excellent third-party packages in python to help you do it, such as Requests, mechanize
Processing after web crawling
The crawled web pages usually need to be processed, such as filtering html tags, extracting text, etc. Python's beautifulsoap provides a concise document processing function, which can complete most document processing with very short code.
In fact, many languages and tools can do the above functions, but python can do the fastest and cleanest. Life is short, u need python.
For the last sentence of'Lifeisshort, uneedpython', I immediately bought a python book on Dangdang! I have worshipped python before, and I have always wanted to learn because of various excuses. .
py is very powerful for linux, and the language is quite simple.
Knowledge point expansion:
Use python to write crawler related examples:
# coding:utf-8import urllib
domain ='http://www.liaoxuefeng.com' #Liao Xuefeng's domain name
path = r'C:\Users\cyhhao2013\Desktop\temp\' #html path to save
# An html header file
input =open(r'C:\Users\cyhhao2013\Desktop# coding:utf-8import urllib
domain ='http://www.liaoxuefeng.com' #Liao Xuefeng's domain name
path = r'C:\Users\cyhhao2013\Desktop\temp\\' #html path to save
# An html header file
input =open(r'C:\Users\cyhhao2013\Desktop\0.html','r')
head = input.read()
# Open the main interface of the python tutorial
f = urllib.urlopen("http://www.URL to be crawled.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000")
home = f.read()
f.close()
# Replace all spaces and enter (this makes it easy to get the url)
geturl = home.replace("\n","")
geturl = geturl.replace(" ","")
# Get the string containing the url
list = geturl.split(r'em;" <ahref="')[1:]
# Obsessive-compulsive disorder, be sure to add the first page to be perfect
list.insert(0,'/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000" ')
# Start traversing the url list
for li in list:
url = li.split(r'" ')[0]
url = domain + url #Patchwork url
print url
f = urllib.urlopen(url)
html = f.read()
# Get the title in order to write the file name
title = html.split("<title ")[1]
title = title.split(" -Liao Xuefeng's official website</title ")[0]
# You have to turn the code, otherwise it will be a tragedy if you add it to the path
title = title.decode('utf-8').replace("/"," ")
# Intercept text
html = html.split(r'<!-- block main -- ')[1]
html = html.split(r'<h4 Your support is the biggest motivation for the author to write!</h4 ')[0]
html = html.replace(r'src="','src="'+ domain)
# Add the head and tail to form a complete html
html = head + html+"</body </html "
# Output file
output =open(path +"%d"% list.index(li)+ title +'.html','w')
output.write(html)
output.close().html', 'r')
head = input.read()
# Open the main interface of the python tutorial
f = urllib.urlopen("http://www.URL to be crawled.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000")
home = f.read()
f.close()
# Replace all spaces and enter (this makes it easy to get the url)
geturl = home.replace("\n","")
geturl = geturl.replace(" ","")
# Get the string containing the url
list = geturl.split(r'em;" <ahref="')[1:]
# Obsessive-compulsive disorder, be sure to add the first page to be perfect
list.insert(0,'/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000" ')
# Start traversing the url list
for li in list:
url = li.split(r'" ')[0]
url = domain + url #Patchwork url
print url
f = urllib.urlopen(url)
html = f.read()
# Get the title in order to write the file name
title = html.split("<title ")[1]
title = title.split(" -Liao Xuefeng's official website</title ")[0]
# You have to turn the code, otherwise it will be a tragedy if you add it to the path
title = title.decode('utf-8').replace("/"," ")
# Intercept text
html = html.split(r'<!-- block main -- ')[1]
html = html.split(r'<h4 Your support is the biggest motivation for the author to write!</h4 ')[0]
html = html.replace(r'src="','src="'+ domain)
# Add the head and tail to form a complete html
html = head + html+"</body </html "
# Output file
output =open(path +"%d"% list.index(li)+ title +'.html','w')
output.write(html)
output.close()
So far, this article on why python is suitable for writing crawlers is introduced. For more information about why Python is suitable for writing crawlers, please search for ZaLou.Cn's previous articles or continue to browse the related articles below. Hope you will support ZaLou more in the future. Cn!