Python crawler gerapy crawler management

13. Gerapy

learning target#####

Understand what is Gerapy
Master the installation of Gerapy
Master Gerapy configuration startup
Master scrapy project management through Gerapy configuration

1. Gerapy introduction:

Gerapy is a distributed crawler management framework, supports Python 3, based on Scrapy, Scrapyd, Scrapyd-Client, Scrapy-Redis, Scrapyd-API, Scrapy-Splash, Jinjia2, Django, Vue.js development , Gerapy can help us:

Control crawler operation more conveniently
View crawler status more intuitively
View crawl results in more real time
Simpler implementation of project deployment
More unified host management

2. Gerapy installation###

1. Execute the following command and wait for the installation to complete

pip3 install gerapy

2. Verify that gerapy is installed successfully

Execute gerapy in the terminal and the following message will appear

“”"
Usage:
gerapy init [–folder=]
gerapy migrate
gerapy createsuperuser
gerapy runserver [host:port]
“”"

3. Gerapy configuration start

1. Create a new project

gerapy init

After executing this command, a gerapy folder will be generated in the current directory, enter the folder, and you will find a folder named projects

2. To initialize the database (operate in the gerapy directory), execute the following command

gerapy migrate

After the database is initialized, a SQLite database will be generated, which stores the host configuration information and deployment version, etc.

3. Start gerapy service

gerapy runserver

At this time, the Gerapy service is enabled on port 8000 of the machine where the gerapy service is started. Enter http://localhost:8000 in the browser to enter the Gerapy management interface. Host management and interface management can be performed on the management interface.

4. Manage scrapy projects through Gerapy configuration###

Configure the host
Add scrapyd host

You need to add the IP, port, and name. Click Create to complete the addition. Click Back to see the list of currently added Scrapyd services. After the creation is successful, we can view the added services in the list

2. To execute the crawler, click Schedule. Then run. (The premise is that the crawler has been released in the scrapyd we configured.)

Configure Projects
We can put the scarpy project directly under /gerapy/projects.

You can see a project in the gerapy background

Click Deploy and click the Deploy button to package and deploy. In the lower right corner, we can enter the description of the package, similar to Git’s commit information, and then click the package button to find that Gerapy will prompt that the package is successful, and the packaged is displayed on the left Result and package name.

Select a site, click Deploy on the right to deploy the project to this site

After successful deployment, the description and deployment time will be displayed

Go to the clients interface, find the node where the project is deployed, and click Schedule

Find the project in the project list in the node, click run on the right to run the project

Supplement:

1. Is Gerapy related to scrapyd??

We only use scrapyd to call scrapy for crawling. Just use the command line to start the crawler
curl http://127.0.0.1:6800/schedule.json -d project=project name-d spider=crawler name
Using Greapy is to turn the use of the command line to start the crawler into a "little hand". After we have configured scrapyd in gerapy, there is no need to use the command line, and the crawler can be started directly through the graphical interface.

summary#####

Understand what is Gerapy
Master the installation of Gerapy
Master Gerapy configuration startup
Master scrapy project management through Gerapy configuration