Implementation of Python headless crawler to download files

Some pages cannot directly use requests to obtain content, and will dynamically execute some js code to generate content. This article is mainly for those special pages, such as the situation where js calls must be made to download.

Install chrome

wget [https://dl.google.com/linux/direct/google-chrome-stable\_current\_x86\_64.rpm](https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm)
yum install ./google-chrome-stable\_current\_x86\_64.rpm
yum install mesa-libOSMesa-devel gnu-free-sans-fonts wqy-zenhei-fonts

Install chromedriver

Taobao source (recommended)

wget http://npm.taobao.org/mirrors/chromedriver/2.41/chromedriver_linux64.zip
unzip chromedriver\_linux64.zip
move chromedriver /usr/bin/
chmod +x /usr/bin/chromedriver

Thanks for this blog

For the above steps, you can choose the version that suits you to download. Note: Chrome and chrome driver must match the version, and chrome driver will note the supported chrome version number.

Actual operation

Need to introduce libraries

from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

Chrome startup settings

chrome_options =Options()
chrome_options.add_argument('--no-sandbox')#Solve the error that the DevToolsActivePort file does not exist
chrome_options.add_argument('window-size=1920x3000') #Specify browser resolution
chrome_options.add_argument('--disable-gpu') #Google documentation mentions that this attribute needs to be added to avoid bugs
chrome_options.add_argument('--hide-scrollbars') #Hide scroll bar,Deal with some special pages
chrome_options.add_argument('blink-settings=imagesEnabled=false') #Don&#39;t load pictures,boost speed
chrome_options.add_argument('--headless') #The browser does not provide a visual page.If the system does not support visualization under linux, it will fail to start without adding this one

Also thanks to the blog above

Set additional parameters, such as download no pop-up and default download path

prefs ={'profile.default_content_settings.popups':0,'download.default_directory':'./filelist'}
chrome_options.add_experimental_option('prefs', prefs)

Initialize the driver

cls.driver=webdriver.Chrome(options=chrome_options)

Exit the driver

cls.driver.quit()

Request a url

cls.driver.get(url)

Execute the specified js code

cls.driver.execute_script('console.log("helloworld")')

Find the specified element

subtitle = cls.driver.find_element_by_class_name("fubiaoti").text

So far, this article on the implementation of Python headless crawler download files is introduced. For more relevant Python headless crawler download file content, please search for previous articles of ZaLou.Cn or continue to browse related articles below. Hope you will get more Support ZaLou.Cn!