How to use PYTHON to crawl news articles

In this article, we will discuss how to use Python to crawl news articles. This can be done using convenient newspaper packaging.

Introduction to Python newspaper package##

You can use pip to install the newspaper package:

pip install newspaper

After the installation is complete, you can start. The newspaper can work by grabbing an article from a given URL, or by finding links to other news on the web. Let's start by dealing with an article. First, we need to import the Article class. Next, we use this class to download content from the URL to our news article. Then, we use the parse method to parse the HTML. Finally, we can use .text to print the text of the article.

Climb an article###

from newspaper import Article
 
url ="https://www.bloomberg.com/news/articles/2020-08-01/apple-buys-startup-to-turn-iphones-into-payment-terminals?srnd=premium"
 
# download and parse article
article =Article(url)
article.download()
article.parse()
 
# print article text
print(article.text)

You can also get other information about the article, such as links to images or videos embedded in the post.

# get list of image links
article.images
 
# get list of videos - empty inthiscase
article.movies

Download all articles linked on the webpage###

Now, let's see how to link all news articles to the web page. We will use the following news.build method to achieve this. Then, we can use the article_urls method to extract the article URL.

import newspaper
 
site = newspaper.build("https://news.ycombinator.com/")  
 
# get list of article URLs
site.article_urls()

Using the above objects, we can also get the content of each article. Here, all article objects are stored in list.site.articles. For example, let's get the content of the first article.

site_article = site.articles[0]
 
site_article.download()
site_article.parse()print(site_article.text)

Now, let's modify the code to get the top ten articles:

top_articles =[]for index inrange(10):
 article = site.articles[index]
 article.download()
 article.parse()
 top_articles.append(article)

caveat!

When using the newspaper, an important note is that if you run newspaper.build with the same URL multiple times,
The package will be cached and then deleted articles that have been scraped. For example, in the code below, we run Newspaper.build twice and get different results. When running it the second time, the code only returns the newly added link.

site = newspaper.build("https://news.ycombinator.com/")print(len(site.articles))
 
site = newspaper.build("https://news.ycombinator.com/")print(len(site.articles))

It can be adjusted by adding an extra parameter to the function call, as shown below:

site = newspaper.build("https://news.ycombinator.com/", memoize_articles=False)

How to get the article summary###

The newspaper package also supports some NLP functions. You can check by calling the nlp method.

article = top_articles[3]
 
article.nlp()

Now, let's use the summary method. This will try to return the article summary.

article.summary()

You can also get a list of keywords from the article.

article.keywords

How to get the most popular Google keywords###

The newspaper has some other cool features. For example, we can use the hot method to easily use it to attract the most popular searches on Google.

newspaper.hot()

The package can also return a list of popular URLs, as shown below.

newspaper.popular_urls()

Recommended Posts

How to use PYTHON to crawl news articles
How to use python tuples
How to use python thread pool
How to use SQLite in Python
How to use and and or in Python
How to use the round function in python
How to use the zip function in Python
How to use the format function in python
How to use code running assistant in python
How to comment python code
How to learn python quickly
How to uninstall python plugin
How to understand python objects
How to use python's help function
python how to view webpage code
How to use hanlp in ubuntu
Use python to query Oracle database
Use C++ to write Python3 extensions
How to write python configuration file
Use python to achieve stepwise regression
How to wrap in python code
How to save the python program
How to omit parentheses in Python
How to install Python 3.8 on CentOS 8
How to install Python 3.8 on Ubuntu 18.04
How to write classes in python
How to filter numbers in python
How to read Excel in Python
How to install Python on CentOS 8
How to solve python dict garbled
How to view errors in python
How to write return in python
How to view the python module
How to understand variables in Python
How to clear variables in python
How to understand python object-oriented programming
How to verify successful installation of python
How to make a globe with Python
How to delete cache files in python
How to introduce third-party modules in Python
How to save text files in python
Use Python to make airplane war games
How to write win programs in python
How to run id function in python
How to install third-party modules in Python
How to custom catch errors in python
How to write try statement in python
Python | So collections are so easy to use! !
How to define private attributes in Python
How to use Samba server on Ubuntu 16.04
Use Python to generate Douyin character videos!
R&D: How To Install Python 3 on CentOS 7
How to add custom modules in Python
How to process excel table with python
How to understand global variables in Python
How to view installed modules in python
Python novice learns to use the library
How to install Python2 on Ubuntu20.04 ubuntu/focal64
Use Python to quickly cut out images
How to debug python program using repr
How to learn the Python time module