Scrapy simulation login of Python crawler

scrapy simulation landing##

learning target:
  1. Application request object cookies parameter use
  2. Understand the role of start_requests function
  3. Application constructs and sends post request

1. Review the previous simulation login method###

1.1 How does the requests module implement simulated login?

  1. Request page with cookies directly
  2. Find the url address and send a post request to store the cookie

1.2 How does selenium simulate login?

  1. Find the corresponding input tag, enter the text and click login

1.3 scrapy's simulation landing

  1. Directly carry cookies
  2. Find the url address and send a post request to store the cookie

2. scrapy carries cookies to directly get the page that needs to be logged in

Application scenario
  1. Cookie expiration time is very long, common in some irregular websites
  2. Able to get all the data before the cookie expires
  3. Cooperate with other programs, such as using selenium to get the cookie after login and save it locally, and read the local cookie before scrapy sends the request

2.1 Implementation: Refactor scrapy's starte_rquests method

The start_url in scrapy is processed through start_requests, and the implementation code is as follows

# This is the source code
def start_requests(self):
 cls = self.__class__
 ifmethod_is_overridden(cls, Spider,'make_requests_from_url'):
  warnings.warn("Spider.make_requests_from_url method is deprecated; it ""won't be called in future Scrapy releases. Please ""override Spider.start_requests method instead (see %s.%s)."%(
    cls.__module__, cls.__name__
   ),) for url in self.start_urls:yield self.make_requests_from_url(url)else:for url in self.start_urls:yieldRequest(url, dont_filter=True)

So correspondingly, if the url in the start_url address is a url address that needs to be accessed after logging in, you need to rewrite the start_request method and manually add a cookie in it

2.2 Log in to github with cookies

Test account noobpythoner zhoudawei123

import scrapy
import re

classLogin1Spider(scrapy.Spider):
 name ='login1'
 allowed_domains =['github.com']
 start_urls =['https://github.com/NoobPythoner'] #This is a page that can be accessed after logging in

 def start_requests(self): #Refactor start_requests method
  # This cookies_str is obtained by capture
  cookies_str ='...' #Capture
  # Cookies_str is converted to cookies_dict
  cookies_dict ={i.split('=')[0]:i.split('=')[1]for i in cookies_str.split('; ')}yield scrapy.Request(
   self.start_urls[0],
   callback=self.parse,
   cookies=cookies_dict
        )

 def parse(self, response): #Use regular expressions to match the username to verify whether the login is successful
  # The regular match is the username of github
  result_list = re.findall(r'noobpythoner|NoobPythoner', response.body.decode())print(result_list)
  pass
Note:
  1. Cookies in scrapy cannot be placed in headers, there are special cookies parameters when constructing the request, which can accept cookies in dictionary form
  2. Set ROBOTS protocol, USER_AGENT in setting

3. scrapy.Request sends post request###

We know that you can send post requests by specifying method and body parameters through scrapy.Request(); but usually scrapy.FormRequest() is used to send post requests

3.1 Send post request

Note: scrapy.FormRequest() can send forms and ajax requests, please refer to https://www.jb51.net/article/146769.htm

3.1.1 Thinking analysis
  1. Find the url address of the post: click the login button to capture the packet, and then locate the url address as https://github.com/session

  2. Find the law of the request body: analyze the request body of the post request, and the parameters contained in it are in the previous response

  3. Whether the login is successful: Observe whether the user name is included by requesting the personal homepage

3.1.2 The code is implemented as follows:
import scrapy
import re

classLogin2Spider(scrapy.Spider):
 name ='login2'
 allowed_domains =['github.com']
 start_urls =['https://github.com/login']

 def parse(self, response):
  authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
  utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
  commit = response.xpath("//input[@name='commit']/@value").extract_first()
        
  # Construct a POST request and pass it to the engine
  yield scrapy.FormRequest("https://github.com/session",
   formdata={"authenticity_token":authenticity_token,"utf8":utf8,"commit":commit,"login":"noobpythoner","password":"***"},
   callback=self.parse_login
       )

 def parse_login(self,response):
  ret = re.findall(r"noobpythoner|NoobPythoner",response.text)print(ret)
Tips#####

By setting COOKIES_DEBUG=TRUE in settings.py, you can see the cookie delivery process in the terminal


summary##

  1. The URL address in start_urls is handed over to start_request for processing. If necessary, the start_request function can be rewritten
  2. Log in with cookies directly: cookies can only be passed to cookies parameter reception
  3. scrapy.Request() send post request

Recommended Posts

Scrapy simulation login of Python crawler
Analysis of JS of Python crawler
Python simulation of the landlord deal
Mongodb and python interaction of python crawler
Learning path of python crawler development
The usage of Ajax in Python3 crawler
7 features of Python3.9
Python3 crawler learning.md
Implementation of Python headless crawler to download files
Python web crawler (practice)
Basics of Python syntax
python_ crawler basic learning
Basic syntax of Python
Basic knowledge of Python (1)
Prettytable module of python
09. Common modules of Python3
Python crawler gerapy crawler management
Python simulation to realize the distribution of playing cards
Consolidate the foundation of Python (4)
Consolidate the foundation of Python(7)
In-depth understanding of python list (LIST)
Subscripts of tuples in Python
Python analysis of wav files
Consolidate the foundation of Python(6)
Python3 crawler data cleaning analysis
python king of glory wallpaper
Consolidate the foundation of Python(5)
Python implementation of gomoku program
Analysis of Python Sandbox Escape
Some new features of Python 3.10
Deep understanding of Python multithreading
Analysis of Python object-oriented programming
Python version of OpenCV installation
Python GUI simulation implementation calculator
Selenium visual crawler for python crawler
9 feature engineering techniques of Python
matplotlib of python drawing module
Python method of parameter passing
Consolidate the foundation of Python (3)
Collection of Python Common Modules