Douban Yao short comment crawler
Use regular parsing of web pages to obtain data
Use wordcloud to draw word cloud
# data collection
import requests
import re
import csv
import jieba
import wordcloud
# Multi-page crawler through loop
# Observe the regularity of page links
# https://movie.douban.com/subject/26754233/comments?start=0&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=20&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=40&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=60&limit=20&sort=new_score&status=P
# 20 items per page from 0 to the back, so set a cycle step to crawl 1000 pages
# Remarks, read it high, there are no 1000 pages, modify
page=[]for i inrange(0,80,20):
page.append(i)withopen(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\Short comment.csv','a',newline='',encoding='utf-8')as f:for i in page:
url='https://movie.douban.com/subject/26754233/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
resp=requests.get(url,headers=headers)
html=resp.text
# Parse webpage
res=re.compile('<span class="short">(.*?)</span>')
duanpin=re.findall(res,html)
# save data
for duan in duanpin:
writer=csv.writer(f)
duanpin=[]
duanpin.append(duan)
writer.writerow(duanpin)
# Draw a short comment word cloud diagram
f =open(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\Short comment.csv',encoding='utf-8')
txt=f.read()
txt_list=jieba.lcut(txt)
string=' '.join(txt_list)
w=wordcloud.WordCloud(
width=1000,
height=700,
background_color='white',
font_path="msyh.ttc",
scale=15,
stopwords={" "},
contour_width=5,
contour_color='red')
w.generate(string)
w.to_file(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\\Yao.png')
The short comment data crawled this time is less, there are only so few in the source code of the webpage, which makes me puzzled. I feel that there is a problem. It may be necessary to convert the webpage code into mobile phone data for browsing. Maybe it is There are only a few, who knows
Judging from the word cloud, Yao is still propagating under the banner of history. Therefore, don't watch such historical nihilism movies, because Guanhu's ass is not straight.
Recently, I have learned a lot about crawlers and amateur python, so let's turn to data analysis later.
love&peace