Previous article review###

One article takes you to understand Python crawlers (1)-introduction to basic principles One article takes you to understand Python crawlers (2)-four common foundations Crawler method introduction

The reason why selenium crawler is called visual crawler

Mainly compared to the crawling methods of web page parsing mentioned above

Selenium crawler mainly simulates the click operation of people

The process of selenium driving the browser and operating it can be observed

It’s similar to watching someone else help you manipulate your computer, similar to someone using your computer remotely

Of course, selenium also has no interface mode

Quick start#

Basic introduction to selenium:

Selenium is a complete web application testing system,

Contains test recording (selenium IDE), writing and running (Selenium Remote Control)
Parallel processing with testing (Selenium Grid).

Selenium Core is based on JsUnit,
It is written entirely in JavaScript, so it can be used on any browser that supports JavaScript.
Selenium can simulate real browsers, automated testing tools, and supports multiple browsers,

The crawler is mainly used to solve JavaScript rendering problems.

When writing crawlers in python, I mainly use Selenium's Webdriver.

# Install selenium library
pip install selenium
# Install the corresponding browser driver
# We can take a look at Selenium first in the following way.Which browsers does Webdriver support
from selenium import webdriver
print(help(webdriver))

Applicable browser:
PACKAGE CONTENTS
 android(package)blackberry(package)chrome(package)common(package)edge(package)firefox(package)ie(package)opera(package)phantomjs(package)remote(package)safari(package)support(package)webkitgtk(package)
# Here to talk about the more important PhantomJS,
# PhantomJS is a server-side JavaScript API based on WebKit,
# Support Web without browser support,
# It is fast and natively supports various web standards: Dom processing, CSS selectors, JSON and more.
# PhantomJS can be used for page automation, network monitoring, web page screenshots, and interfaceless testing

Google Chrome driver download address
Pay attention to the corresponding version number, enter chrome://version/ in the chrome address bar to view your Chrome version
I use anaconda and download it and drop it into the anaconda3\Scripts folder.
If it is other ide such as: pycharm, VScode but the integrated python of anaconda is loaded, you can still do this

Simple test

from selenium import webdriver
# # Declare the browser object
browser1 = webdriver.Chrome()
browser2 = webdriver.Firefox()
# # Visit page
browser1.get("http://www.baidu.com")print(browser1.page_source)
# Close the current window
browser1.close()

Element positioning#

To operate on the page, the first thing to do is to select the page element,
The eight common element positioning methods are as follows

Locate one element	Position multiple elements	Description of positioning method
find_element_by_id	find_elements_by_id	Locate by element id
find_element_by_name	find_elements_by_name	Locate by element name
find_element_by_xpath	find_elements_by_xpath	Locate by xpath path
find_element_by_link_text	find_elements_by_link_text	Locate by complete hyperlink text
find_element_by_partial_link_text	find_elements_by_partial_link_text	Locate through partial hyperlink text
find_element_by_tag_name	find_elements_by_tag_name	Locate by tag name
find_element_by_class_name	find_elements_by_class_name	Locate by class name
find_element_by_css_selector	find_elements_by_css_selector	Locate by css selector

For more detailed positioning methods, please refer to: "The most complete in history! 30 ways to locate Selenium elements》

Page operation#

Form fill

# Find the username and enter the username
user = drive.find_element_by_name("LoginForm[username]")
user.send_keys(username)
# Find the password and enter the password
pwd=drive.find_element_by_id("LoginForm_password")
pwd.send_keys(password)
# Click the login button to log in
drive.find_element_by_class_name("login_btn").click()

Window handle

Simply put, the handle is the unique identifier of each window bar above the browser

# Get all handles of the current window
handles = drive.window_handles
# Switch to the second tab through the handle
drive.switch_to.window(handles[2])"""Operation complete"""
# Close the current window
driver.close() 
# Switch to the first tab through the handle
drive.switch_to.window(handles[0])
time.sleep(random.uniform(2,3))

url load and get

# url loading
drive.get(url)
# Get the current page url and assert
currentPageUrl = driver.current_url

cookie handling

get_cookies: Get cookie information
add_cookie: add cookie information

drive.get("http://www.baidu.com")
cookie ={'name':'foo','value':'bar'}
drive.add_cookie(cookie)
drive.get_cookies()

Waiting method#

Many websites now use Ajax technology
Unable to determine when the page elements can be fully loaded
So the selection of web page elements is more difficult
At this time, you need to set a wait (wait for the page to load)

Selenium has two ways to wait:

Explicit wait
Implicit wait

1. Explicit wait
Explicit wait is a condition triggered wait
Will not continue until a certain condition is met
You can set a timeout, if the element is not loaded after the timeout, an exception will be thrown

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
drive = webdriver.Chrome()
url ='http://www.baidu.com/'
drive.get(url)try:WebDriverWait(self.driver,10).until(EC.presence_of_element_located(By.ID,"LoginForm[username]"))  #Show waiting
except:print('%s page not found element'% loc)

The above code loads the'http://www.baidu.com/' page
And locate the element with id "LoginForm[username]"
Set the timeout time to 10 seconds, webDriverWait will check whether the element exists in 500ms by default

Selenium provides some built-in methods for displaying waiting,
Located in the expected_conditions class, see the table below for details

Built-in method	function
title_is	Determine whether the title of the current page is equal to the expected content
title_contains	Determine whether the title of the current page contains the expected string
presence_of_element_located	Judging whether an element has been added to the dom tree does not mean that the element must be visible
presence_of_all_element_located	Determine whether at least 1 element exists in the dom tree
visibility_of_element_located	Determine whether an element is visible
visibility_of	Determine whether an element is visible
invisibility_of_element_located	Determine whether an element does not exist in the dom tree or is not visible
text_to_be_present_in_element	Determine whether the text in the element contains the expected string
text_to_be_present_in_element_value	Determine whether the value attribute in the element contains the expected character
frame_to_be_available_and_switch_to_it	Judge whether the frame can be switched in, if it can, return True and switch in, otherwise return False
element_to_be_clickable	Judging whether an element is visible and enabled
staleness_of	Wait for an element to be removed from the dom tree
element_to_be_selected	Judging whether an element is selected, generally used for drop-down lists
element_located_to_be_selected	Judging whether an element is selected, generally used for drop-down lists
element_selection_state_to_be	Determine whether the selected state of an element meets expectations
element_located_selection_state_to_be	Determine whether the selected state of an element meets expectations
alert_is_present	Determine whether there is an alert box on the page

2. Implicit wait

Implicit waiting is when trying to locate an element, if it is not found immediately, it waits for a fixed period of time
Similar to socket timeout, the default setting is 0 seconds, which is equivalent to the longest waiting time

The intuitive feeling in the browser interface is:
Wait until the web page is loaded (the place in the address bar is not × becomes as follows) and continue execution
The page load exceeds the set waiting time before an error is reported

Instructions

from selenium import webdriver
drive = webdriver.Chrome()
url ='http://www.baidu.com/'
# Set the maximum waiting time to 10 seconds
drive.implicitly_wait(10)
drive.get(url)
user = drive.find_element_by_name("LoginForm[username]")

3. Thread sleep
time.sleep(time) is the more commonly used thread sleep mode
To avoid risks, I personally prefer random sleep
time.sleep(random.uniform(4,5))

Extension loading#

# Set up the application extension
chrome_options.add_extension(extension_path)
# Add download path
# download.default_directory: set the download path profile.default_content_settings.popups: set to 0 to prohibit pop-up windows
prefs ={'profile.default_content_settings.popups':0,'download.default_directory':tmp_path}
chrome_options.add_experimental_option('prefs', prefs)