Python crawler-beautifulsoup use

python crawling weather#

Overview##

For the simple use of beautifulsoup, beautifulsoup is a third-party library used by beginners in crawlers, with simple operation and friendly code.
Include the code in the function, and implement repeated crawling by calling the function

Code##

import requests
from bs4 import BeautifulSoup
# pandas library, used to save data, and this is also the basic library
import pandas as pd
# retrieve data

# Get webpage source code
def get_data(url):
 resp=requests.get(url)
 # utf-8 not supported
 html=resp.content.decode('gbk')
 # Parse the original html file
 # html.parser is a built-in parser, which may be slower in analysis
 soup=BeautifulSoup(html,'html.parser')
 # By find_The all function finds all tr tags
 tr_list=soup.find_all('tr')
 # Name three lists for receiving data
 dates,conditions,temp=[],[],[]for data in tr_list[1:]:
  sub_data=data.text.split()
  dates.append(sub_data[0])
  conditions.append(''.join(sub_data[1:3]))
  temp.append(''.join(sub_data[3:6]))
 # Create an empty data frame for storing data
 _ data=pd.DataFrame()
 _ data['date']=dates
 _ data['the weather']=conditions
 _ data['temperature']=temp
 # Return data
 return _data

data1=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201101.html')
data2=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201102.html')
data3=get_data('http://www.tianqihoubao.com/lishi/beijing/month/201103.html')
# Connect the three data frames through concat and reset the index
df=pd.concat([data1,data2,data3]).reset_index(drop=True)

# Data preprocessing
# Pass the temperature/Sort
df['Maximum temperature']=df['temperature'].str.split('/',expand=True)[0]
df['lowest temperature']=df['temperature'].str.split('/',expand=True)[1]
# Use the map function to replace the ℃ in the temperature and convert it to a number to facilitate subsequent analysis
df['Maximum temperature']=df['Maximum temperature'].map(lambda x:int(x.replace('℃','')))
df['lowest temperature']=df['lowest temperature'].map(lambda x:int(x.replace('℃','')))

# save
df.to_csv('./python/Crawling weather data/beijing.csv',index=False,encoding='utf-8')

# Read when used
pd.read_csv('./python/Crawling weather data/beijing.csv')

Conclusion#

All projects about crawlers are practical projects. There is no theory. The idea is that the basic theory is easy to expire. It feels a bit laborious to eat textbooks. Many projects have changed. And some crawlers are based on python2, so this method may be the best. Way out.

Recommended Posts

Python crawler-beautifulsoup use
Python3 external module use
What databases can python use
How to use python tuples
Use of Pandas in Python development
How to use python thread pool
Use python to query Oracle database
Use nohup command instructions in python
Use C++ to write Python3 extensions
Use python to achieve stepwise regression
Use of numpy in Python development
What is the use of Python
How to use SQLite in Python
Python Xiaobai should not use expressions indiscriminately
Python multithreading
Python CookBook
Python FAQ
Python3 dictionary
How to use and and or in Python
python (you-get)
Python string
Use Python to make airplane war games
Python basics
Python descriptor
Python basics 2
Python exec
Python notes
Python3 tuple
CentOS + Python3.6+
Python advanced (1)
Python IO
Python multithreading
Python toolchain
Python | So collections are so easy to use! !
Python3 list
Python multitasking-coroutine
Python overview
Python class decorator, use a small demo
Use Python to generate Douyin character videos!
Python analytic
Python basics
07. Python3 functions
Python basics 3
Python novice learns to use the library
Python multitasking-threads
Use Python to quickly cut out images
Python functions
Use python3 to install third on ubuntu
python sys.stdout
python operator
Python entry-3
Centos 7.5 python3.6
Python string
python queue Queue
Python basics 4
Python basics 5
How to use the round function in python
Dry goods | Use Python to operate mysql database
How to use the zip function in Python
How to use the format function in python
How to use code running assistant in python