For the simple use of beautifulsoup, beautifulsoup is a third-party library used by beginners in crawlers, with simple operation and friendly code.
Include the code in the function, and implement repeated crawling by calling the function
import requests
from bs4 import BeautifulSoup
# pandas library, used to save data, and this is also the basic library
import pandas as pd
# retrieve data
# Get webpage source code
def get_data(url):
resp=requests.get(url)
# utf-8 not supported
html=resp.content.decode('gbk')
# Parse the original html file
# html.parser is a built-in parser, which may be slower in analysis
soup=BeautifulSoup(html,'html.parser')
# By find_The all function finds all tr tags
tr_list=soup.find_all('tr')
# Name three lists for receiving data
dates,conditions,temp=[],[],[]for data in tr_list[1:]:
sub_data=data.text.split()
dates.append(sub_data[0])
conditions.append(''.join(sub_data[1:3]))
temp.append(''.join(sub_data[3:6]))
# Create an empty data frame for storing data
_ data=pd.DataFrame()
_ data['date']=dates
_ data['the weather']=conditions
_ data['temperature']=temp
# Return data
return _data
data1=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201101.html')
data2=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201102.html')
data3=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201103.html')
# Connect the three data frames through concat and reset the index
df=pd.concat([data1,data2,data3]).reset_index(drop=True)
# Data preprocessing
# Pass the temperature/Sort
df['Maximum temperature']=df['temperature'].str.split('/',expand=True)[0]
df['lowest temperature']=df['temperature'].str.split('/',expand=True)[1]
# Use the map function to replace the ℃ in the temperature and convert it to a number to facilitate subsequent analysis
df['Maximum temperature']=df['Maximum temperature'].map(lambda x:int(x.replace('℃','')))
df['lowest temperature']=df['lowest temperature'].map(lambda x:int(x.replace('℃','')))
# save
df.to_csv('./python/Crawling weather data/beijing.csv',index=False,encoding='utf-8')
# Read when used
pd.read_csv('./python/Crawling weather data/beijing.csv')
All projects about crawlers are practical projects. There is no theory. The idea is that the basic theory is easy to expire. It feels a bit laborious to eat textbooks. Many projects have changed. And some crawlers are based on python2, so this method may be the best. Way out.
Recommended Posts