Scrapy simulation login of Python crawler

scrapy simulation landing##

learning target:

Application request object cookies parameter use
Understand the role of start_requests function
Application constructs and sends post request

1. Review the previous simulation login method###

1.1 How does the requests module implement simulated login?

Request page with cookies directly
Find the url address and send a post request to store the cookie

1.2 How does selenium simulate login?

Find the corresponding input tag, enter the text and click login

1.3 scrapy's simulation landing

Directly carry cookies
Find the url address and send a post request to store the cookie

2. scrapy carries cookies to directly get the page that needs to be logged in

Application scenario

Cookie expiration time is very long, common in some irregular websites
Able to get all the data before the cookie expires
Cooperate with other programs, such as using selenium to get the cookie after login and save it locally, and read the local cookie before scrapy sends the request

2.1 Implementation: Refactor scrapy's starte_rquests method

The start_url in scrapy is processed through start_requests, and the implementation code is as follows

# This is the source code
def start_requests(self):
 cls = self.__class__
 ifmethod_is_overridden(cls, Spider,'make_requests_from_url'):
  warnings.warn("Spider.make_requests_from_url method is deprecated; it ""won't be called in future Scrapy releases. Please ""override Spider.start_requests method instead (see %s.%s)."%(
    cls.__module__, cls.__name__
   ),) for url in self.start_urls:yield self.make_requests_from_url(url)else:for url in self.start_urls:yieldRequest(url, dont_filter=True)

So correspondingly, if the url in the start_url address is a url address that needs to be accessed after logging in, you need to rewrite the start_request method and manually add a cookie in it

2.2 Log in to github with cookies

Test account noobpythoner zhoudawei123

import scrapy
import re

classLogin1Spider(scrapy.Spider):
 name ='login1'
 allowed_domains =['github.com']
 start_urls =['https://github.com/NoobPythoner'] #This is a page that can be accessed after logging in

 def start_requests(self): #Refactor start_requests method
  # This cookies_str is obtained by capture
  cookies_str ='...' #Capture
  # Cookies_str is converted to cookies_dict
  cookies_dict ={i.split('=')[0]:i.split('=')[1]for i in cookies_str.split('; ')}yield scrapy.Request(
   self.start_urls[0],
   callback=self.parse,
   cookies=cookies_dict
        )

 def parse(self, response): #Use regular expressions to match the username to verify whether the login is successful
  # The regular match is the username of github
  result_list = re.findall(r'noobpythoner|NoobPythoner', response.body.decode())print(result_list)
  pass

Note:

Cookies in scrapy cannot be placed in headers, there are special cookies parameters when constructing the request, which can accept cookies in dictionary form
Set ROBOTS protocol, USER_AGENT in setting

3. scrapy.Request sends post request###

We know that you can send post requests by specifying method and body parameters through scrapy.Request(); but usually scrapy.FormRequest() is used to send post requests

3.1 Send post request

Note: scrapy.FormRequest() can send forms and ajax requests, please refer to https://www.jb51.net/article/146769.htm

3.1.1 Thinking analysis

Find the url address of the post: click the login button to capture the packet, and then locate the url address as https://github.com/session
Find the law of the request body: analyze the request body of the post request, and the parameters contained in it are in the previous response
Whether the login is successful: Observe whether the user name is included by requesting the personal homepage

3.1.2 The code is implemented as follows:

import scrapy
import re

classLogin2Spider(scrapy.Spider):
 name ='login2'
 allowed_domains =['github.com']
 start_urls =['https://github.com/login']

 def parse(self, response):
  authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
  utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
  commit = response.xpath("//input[@name='commit']/@value").extract_first()
        
  # Construct a POST request and pass it to the engine
  yield scrapy.FormRequest("https://github.com/session",
   formdata={"authenticity_token":authenticity_token,"utf8":utf8,"commit":commit,"login":"noobpythoner","password":"***"},
   callback=self.parse_login
       )

 def parse_login(self,response):
  ret = re.findall(r"noobpythoner|NoobPythoner",response.text)print(ret)

Tips#####

By setting COOKIES_DEBUG=TRUE in settings.py, you can see the cookie delivery process in the terminal

summary##

The URL address in start_urls is handed over to start_request for processing. If necessary, the start_request function can be rewritten
Log in with cookies directly: cookies can only be passed to cookies parameter reception
scrapy.Request() send post request