python crawler-八百词云#

Overview##

Douban Yao short comment crawler

Ideas##

Use regular parsing of web pages to obtain data
Use wordcloud to draw word cloud

Code##

# data collection
import requests
import re
import csv
import jieba
import wordcloud
# Multi-page crawler through loop
# Observe the regularity of page links
# https://movie.douban.com/subject/26754233/comments?start=0&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=20&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=40&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=60&limit=20&sort=new_score&status=P
# 20 items per page from 0 to the back, so set a cycle step to crawl 1000 pages
# Remarks, read it high, there are no 1000 pages, modify
page=[]for i inrange(0,80,20):
 page.append(i)withopen(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\Short comment.csv','a',newline='',encoding='utf-8')as f:for i in page:
  url='https://movie.douban.com/subject/26754233/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
  headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
  resp=requests.get(url,headers=headers)
  html=resp.text

  # Parse webpage
  res=re.compile('<span class="short">(.*?)</span>')
  duanpin=re.findall(res,html)

  # save data

  for duan in duanpin:            
   writer=csv.writer(f)
   duanpin=[]
   duanpin.append(duan)
   writer.writerow(duanpin)
# Draw a short comment word cloud diagram

f =open(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\Short comment.csv',encoding='utf-8')
txt=f.read()

txt_list=jieba.lcut(txt)
string=' '.join(txt_list)
w=wordcloud.WordCloud(
 width=1000,
 height=700,
 background_color='white',
 font_path="msyh.ttc",
 scale=15,
 stopwords={" "},
 contour_width=5,
 contour_color='red')

w.generate(string)
w.to_file(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\Douban Yao Reptile\\Yao.png')

result##

The short comment data crawled this time is less, there are only so few in the source code of the webpage, which makes me puzzled. I feel that there is a problem. It may be necessary to convert the webpage code into mobile phone data for browsing. Maybe it is There are only a few, who knows
Judging from the word cloud, Yao is still propagating under the banner of history. Therefore, don't watch such historical nihilism movies, because Guanhu's ass is not straight.

Concluding remarks##

Recently, I have learned a lot about crawlers and amateur python, so let's turn to data analysis later.

love&peace