Some pages cannot directly use requests to obtain content, and will dynamically execute some js code to generate content. This article is mainly for those special pages, such as the situation where js calls must be made to download.
Install chrome
wget [https://dl.google.com/linux/direct/google-chrome-stable\_current\_x86\_64.rpm](https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm)
yum install ./google-chrome-stable\_current\_x86\_64.rpm
yum install mesa-libOSMesa-devel gnu-free-sans-fonts wqy-zenhei-fonts
Install chromedriver
Taobao source (recommended)
wget http://npm.taobao.org/mirrors/chromedriver/2.41/chromedriver_linux64.zip
unzip chromedriver\_linux64.zip
move chromedriver /usr/bin/
chmod +x /usr/bin/chromedriver
Thanks for this blog
For the above steps, you can choose the version that suits you to download. Note: Chrome and chrome driver must match the version, and chrome driver will note the supported chrome version number.
Actual operation
Need to introduce libraries
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
Chrome startup settings
chrome_options =Options()
chrome_options.add_argument('--no-sandbox')#Solve the error that the DevToolsActivePort file does not exist
chrome_options.add_argument('window-size=1920x3000') #Specify browser resolution
chrome_options.add_argument('--disable-gpu') #Google documentation mentions that this attribute needs to be added to avoid bugs
chrome_options.add_argument('--hide-scrollbars') #Hide scroll bar,Deal with some special pages
chrome_options.add_argument('blink-settings=imagesEnabled=false') #Don't load pictures,boost speed
chrome_options.add_argument('--headless') #The browser does not provide a visual page.If the system does not support visualization under linux, it will fail to start without adding this one
Also thanks to the blog above
Set additional parameters, such as download no pop-up and default download path
prefs ={'profile.default_content_settings.popups':0,'download.default_directory':'./filelist'}
chrome_options.add_experimental_option('prefs', prefs)
Initialize the driver
cls.driver=webdriver.Chrome(options=chrome_options)
Exit the driver
cls.driver.quit()
Request a url
cls.driver.get(url)
Execute the specified js code
cls.driver.execute_script('console.log("helloworld")')
Find the specified element
subtitle = cls.driver.find_element_by_class_name("fubiaoti").text
So far, this article on the implementation of Python headless crawler download files is introduced. For more relevant Python headless crawler download file content, please search for previous articles of ZaLou.Cn or continue to browse related articles below. Hope you will get more Support ZaLou.Cn!
Recommended Posts