One article takes you to understand Python crawlers (1)-introduction to basic principles One article takes you to understand Python crawlers (2)-four common foundations Crawler method introduction
The reason why selenium crawler is called visual crawler
Mainly compared to the crawling methods of web page parsing mentioned above
Selenium crawler mainly simulates the click operation of people
The process of selenium driving the browser and operating it can be observed
It’s similar to watching someone else help you manipulate your computer, similar to someone using your computer remotely
Of course, selenium also has no interface mode
Basic introduction to selenium:
Selenium is a complete web application testing system,
Contains test recording (selenium IDE), writing and running (Selenium Remote Control)
Parallel processing with testing (Selenium Grid).
Selenium Core is based on JsUnit,
It is written entirely in JavaScript, so it can be used on any browser that supports JavaScript.
Selenium can simulate real browsers, automated testing tools, and supports multiple browsers,
The crawler is mainly used to solve JavaScript rendering problems.
When writing crawlers in python, I mainly use Selenium's Webdriver.
# Install selenium library
pip install selenium
# Install the corresponding browser driver
# We can take a look at Selenium first in the following way.Which browsers does Webdriver support
from selenium import webdriver
print(help(webdriver))
Applicable browser:
PACKAGE CONTENTS
android(package)blackberry(package)chrome(package)common(package)edge(package)firefox(package)ie(package)opera(package)phantomjs(package)remote(package)safari(package)support(package)webkitgtk(package)
# Here to talk about the more important PhantomJS,
# PhantomJS is a server-side JavaScript API based on WebKit,
# Support Web without browser support,
# It is fast and natively supports various web standards: Dom processing, CSS selectors, JSON and more.
# PhantomJS can be used for page automation, network monitoring, web page screenshots, and interfaceless testing
Google Chrome driver download address
Pay attention to the corresponding version number, enter chrome://version/ in the chrome address bar to view your Chrome version
I use anaconda and download it and drop it into the anaconda3\Scripts folder.
If it is other ide such as: pycharm, VScode but the integrated python of anaconda is loaded, you can still do this
Simple test
from selenium import webdriver
# # Declare the browser object
browser1 = webdriver.Chrome()
browser2 = webdriver.Firefox()
# # Visit page
browser1.get("http://www.baidu.com")print(browser1.page_source)
# Close the current window
browser1.close()
To operate on the page, the first thing to do is to select the page element,
The eight common element positioning methods are as follows
Locate one element | Position multiple elements | Description of positioning method |
---|---|---|
find_element_by_id | find_elements_by_id | Locate by element id |
find_element_by_name | find_elements_by_name | Locate by element name |
find_element_by_xpath | find_elements_by_xpath | Locate by xpath path |
find_element_by_link_text | find_elements_by_link_text | Locate by complete hyperlink text |
find_element_by_partial_link_text | find_elements_by_partial_link_text | Locate through partial hyperlink text |
find_element_by_tag_name | find_elements_by_tag_name | Locate by tag name |
find_element_by_class_name | find_elements_by_class_name | Locate by class name |
find_element_by_css_selector | find_elements_by_css_selector | Locate by css selector |
For more detailed positioning methods, please refer to: "The most complete in history! 30 ways to locate Selenium elements》
# Find the username and enter the username
user = drive.find_element_by_name("LoginForm[username]")
user.send_keys(username)
# Find the password and enter the password
pwd=drive.find_element_by_id("LoginForm_password")
pwd.send_keys(password)
# Click the login button to log in
drive.find_element_by_class_name("login_btn").click()
Simply put, the handle is the unique identifier of each window bar above the browser
# Get all handles of the current window
handles = drive.window_handles
# Switch to the second tab through the handle
drive.switch_to.window(handles[2])"""Operation complete"""
# Close the current window
driver.close()
# Switch to the first tab through the handle
drive.switch_to.window(handles[0])
time.sleep(random.uniform(2,3))
# url loading
drive.get(url)
# Get the current page url and assert
currentPageUrl = driver.current_url
drive.get("http://www.baidu.com")
cookie ={'name':'foo','value':'bar'}
drive.add_cookie(cookie)
drive.get_cookies()
Many websites now use Ajax technology
Unable to determine when the page elements can be fully loaded
So the selection of web page elements is more difficult
At this time, you need to set a wait (wait for the page to load)
Selenium has two ways to wait:
1. Explicit wait
Explicit wait is a condition triggered wait
Will not continue until a certain condition is met
You can set a timeout, if the element is not loaded after the timeout, an exception will be thrown
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
drive = webdriver.Chrome()
url ='http://www.baidu.com/'
drive.get(url)try:WebDriverWait(self.driver,10).until(EC.presence_of_element_located(By.ID,"LoginForm[username]")) #Show waiting
except:print('%s page not found element'% loc)
The above code loads the'http://www.baidu.com/' page
And locate the element with id "LoginForm[username]"
Set the timeout time to 10 seconds, webDriverWait will check whether the element exists in 500ms by default
Selenium provides some built-in methods for displaying waiting,
Located in the expected_conditions class, see the table below for details
Built-in method | function |
---|---|
title_is | Determine whether the title of the current page is equal to the expected content |
title_contains | Determine whether the title of the current page contains the expected string |
presence_of_element_located | Judging whether an element has been added to the dom tree does not mean that the element must be visible |
presence_of_all_element_located | Determine whether at least 1 element exists in the dom tree |
visibility_of_element_located | Determine whether an element is visible |
visibility_of | Determine whether an element is visible |
invisibility_of_element_located | Determine whether an element does not exist in the dom tree or is not visible |
text_to_be_present_in_element | Determine whether the text in the element contains the expected string |
text_to_be_present_in_element_value | Determine whether the value attribute in the element contains the expected character |
frame_to_be_available_and_switch_to_it | Judge whether the frame can be switched in, if it can, return True and switch in, otherwise return False |
element_to_be_clickable | Judging whether an element is visible and enabled |
staleness_of | Wait for an element to be removed from the dom tree |
element_to_be_selected | Judging whether an element is selected, generally used for drop-down lists |
element_located_to_be_selected | Judging whether an element is selected, generally used for drop-down lists |
element_selection_state_to_be | Determine whether the selected state of an element meets expectations |
element_located_selection_state_to_be | Determine whether the selected state of an element meets expectations |
alert_is_present | Determine whether there is an alert box on the page |
2. Implicit wait
Implicit waiting is when trying to locate an element, if it is not found immediately, it waits for a fixed period of time
Similar to socket timeout, the default setting is 0 seconds, which is equivalent to the longest waiting time
The intuitive feeling in the browser interface is:
Wait until the web page is loaded (the place in the address bar is not × becomes as follows) and continue execution
The page load exceeds the set waiting time before an error is reported
Instructions
from selenium import webdriver
drive = webdriver.Chrome()
url ='http://www.baidu.com/'
# Set the maximum waiting time to 10 seconds
drive.implicitly_wait(10)
drive.get(url)
user = drive.find_element_by_name("LoginForm[username]")
3. Thread sleep
time.sleep(time) is the more commonly used thread sleep mode
To avoid risks, I personally prefer random sleep
time.sleep(random.uniform(4,5))
# Set up the application extension
chrome_options.add_extension(extension_path)
# Add download path
# download.default_directory: set the download path profile.default_content_settings.popups: set to 0 to prohibit pop-up windows
prefs ={'profile.default_content_settings.popups':0,'download.default_directory':tmp_path}
chrome_options.add_experimental_option('prefs', prefs)
Recommended Posts