Detailed explanation of Python web page parser usage examples

python web parser

  1. Common python web page parsing tools include: re regular matching, python's own html.parser module, third-party library BeautifulSoup (emphasis on learning) and lxm library.

  1. Classification of common web page parsers

(1) Fuzzy matching: re regular expression is a string-like fuzzy matching mode;

(2) Structural analysis: BeatufiulSoup, html.parser and lxml, they all use the DOM tree structure as the standard to extract the label structure information.

  1. DOM tree explanation: namely Document Object Model, its tree label structure, please see the figure below.

The so-called structured analysis means that the web page parser treats the entire downloaded HTML document as a Doucment object, and then uses the label form of its upper and lower structure to traverse the upper and lower labels of this object and extract information.

# Introduce related packages, urllib and bs4, which are the most commonly used libraries for obtaining and parsing web pages
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open link
html=urlopen("https://www.datalearner.com/website_navi")

# Obtain the web page object through urlopen, put it into BeautifulSoup, the html document of the target web page stored by bsObj

bsObj=BeautifulSoup(html.read())print(bsObj)

soup = BeautifulSoup(open(url,’r’,encoding = ‘utf-8’))

import requests
from bs4 import BeautifulSoup

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36','referer':"www.mmjpg.com"}
all_url ='http://www.mmjpg.com/'
 #' User-Agent':Request method
 #' referer':From which link

start_html = requests.get(all_url, headers=headers)
 # all_url: the starting address, which is the first page visited
 # headers: request headers, tell the server who is coming.
 # requests.get: One method can get all_url the page content and return the content.

Soup =BeautifulSoup(start_html.text,'lxml')
 # BeautifulSoup: Parse the page
 # lxml: parser
 # start_html.text: the content of the page

The above is the whole content of this article, I hope it will be helpful to everyone's study.

Recommended Posts

Detailed explanation of Python web page parser usage examples
Detailed explanation of the usage of Python decimal module
Detailed explanation of python backtracking template
Analysis of usage examples of Python yield
Detailed explanation of python sequence types
Detailed usage of dictionary in Python
Detailed explanation of Python IO port multiplexing
Detailed usage of Python virtual environment venv
Detailed explanation of -u parameter of python command
Detailed explanation of Python guessing algorithm problems
Detailed explanation of the principle of Python super() method
Detailed explanation of python standard library OS module
Detailed explanation of how python supports concurrent methods
Detailed explanation of data types based on Python
Detailed examples of using Python to calculate KS
Detailed explanation of the principle of Python function parameter classification
Detailed explanation of the principle of Python timer thread pool
Detailed explanation of the implementation steps of Python interface development
Detailed explanation of common tools for Python process control
Detailed explanation of the attribute access process of Python objects
Detailed explanation of the remaining problem based on python (%)
The usage of wheel in python
Detailed implementation of Python plug-in mechanism
Detailed explanation of ubuntu using gpg2
Python error handling assert detailed explanation
Usage of os package in python
Detailed examples of Centos6 network configuration
Simple usage of python definition class
Some examples of python operation redis
The usage of tuples in python
Detailed explanation of the use of pip in Python | summary of third-party library installation
The usage of Ajax in Python3 crawler
Detailed analysis of Python garbage collection mechanism
Python from attribute to property detailed explanation
Ubuntu20.04 install Python3 virtual environment tutorial detailed explanation
Detailed explanation of building Hadoop environment on CentOS 6.5