Scrapyd

Scrapyd: Step-by-Step Tutorial

Whether you’re scraping data at scale or need an easier way to manage multiple spiders, Scrapyd simplifies the process and helps keep everything running smoothly.

What Is Scrapyd?

Scrapyd is a tool for deploying and managing Scrapy spiders on a server. You can control everything remotely through simple API calls. The Scrapyd server runs as a background service, automatically handling crawl requests and executing them without manual intervention.

With Scrapyd, you can:

  • Deploy and manage your Scrapy project remotely with ease.
  • Control all your scraping jobs through a unified JSON API.
  • Monitor and manage your spiders using a user-friendly web interface.
  • Scale your data collection by running spiders across multiple servers.
  • Improve server performance by adjusting the number of concurrent spiders.
  • Automate tasks using tools like Celery and Gerapy.
  • Integrate Scrapy with Python frameworks like Django to enhance your web apps.

Now, let’s dive into how to deploy your Scrapy spiders using Scrapyd.

Alternative Solution — Web Scraping APIs and Tools

If your project requires scraping at scale and you don’t want to deal with proxies and CAPTCHA solvers, you can opt-in for one of the best web scraping APIs and tools. Below, I listed the top 5 web scraping tools.

  1. Bright Data — Best overall for advanced scraping; features extensive proxy management and reliable APIs.
  2. Octoparse — User-friendly no-code tool for automated data extraction from websites.
  3. ScrapingBee — Developer-oriented API that handles proxies, browsers, and CAPTCHAs efficiently.
  4. Scrapy — Open-source Python framework ideal for data crawling and scraping tasks.
  5. ScraperAPI — Handles tough scrapes with advanced anti-bot technologies; great for developers.

I am not affiliated with any of the providers mentioned, I just had a good experience using them.

How to Run Scrapy Spiders with Scrapyd

Prerequisites

Ensure that Python 3+ is installed on your system. You will need to install Scrapy, Scrapyd, and Scrapyd-client via pip.

pip install scrapyd scrapy scrapyd-client

Setting Up Your Scrapy Project

Create a Scrapy Project: Use the command scrapy startproject <PROJECT_NAME> to create a Scrapy project.

Create a Spider: In the spiders folder, create a scraper.py file with the following basic spider:

from scrapy.spiders import Spider
class MySpider(Spider):
name = 'product_scraper'
start_urls = ['https://www.scrapingcourse.com/ecommerce/']
def parse(self, response):
products = response.css('ul.products li.product')
data = []
for product in products:
product_name = product.css('h2.woocommerce-loop-product__title::text').get()
price = product.css('bdi::text').get()
data.append({'product_name': product_name, 'price': price})
self.log(data)

Test Your Spider: Test the spider locally by running:

scrapy crawl product_scraper

This should scrape and log product names and prices from the eCommerce page.

Deploying Spiders to Scrapyd

Start the Scrapyd Server: Run the command below to start the Scrapyd server:

scrapyd

You’ll see the server running on http://localhost:6800.

Configure the Scrapy Project: Modify the scrapy.cfg file in your project to include the correct deployment URL:

[settings]
default = scraper.settings
[deploy:local]
url = http://localhost:6800/
project = scraper

Deploy the Spider: Deploy your spider to Scrapyd using the following command:

scrapyd-deploy local -p scraper

You should see a JSON response confirming the deployment.

Monitor the Deployment: Open your browser and navigate to http://localhost:6800. Your project should be listed under “Available projects.”

Managing Spiders with Scrapyd

Scheduling Tasks

You can schedule spiders using Scrapyd’s JSON API. The scheduling endpoint is http://localhost:6800/schedule.json. Use the following curl command:

curl http://localhost:6800/schedule.json -d project=scraper -d spider=product_scraper

Alternatively, you can create a Python script (schedule.py) to make the request:

import requests
url = 'http://localhost:6800/schedule.json'
data = {'project': 'scraper', 'spider': 'product_scraper'}
response = requests.post(url, data=data)
if response.status_code == 200:
print(response.json())
else:
print(response.json())

Monitoring Jobs

To monitor all running tasks, use the listjobs.json endpoint:

curl http://localhost:6800/listjobs.json?project=scraper

You can also create a Python script (monitor.py) for monitoring:

import requests
url = 'http://localhost:6800/listjobs.json'
params = {'project': 'scraper'}
response = requests.get(url, params=params)
if response.status_code == 200:
print(response.json())
else:
print(response.json())

Cancelling Jobs

To cancel a running job, use the cancel.json endpoint. Provide the job ID you want to cancel:

curl http://localhost:6800/cancel.json -d project=scraper -d job=<TARGET_JOB_ID>

Or, in Python:

import requests
url = 'http://localhost:6800/cancel.json'
data = {'project': 'scraper', 'job': '<TARGET_JOB_ID>'}
response = requests.post(url, data=data)
if response.status_code == 200:
print(response.json())
else:
print(response.json())

ScrapydWeb: User Interface for Managing Scrapy Spiders

ScrapydWeb is a web-based interface for managing Scrapyd tasks. It allows you to easily schedule and monitor spiders, though it currently supports Python versions below 3.9.

Install ScrapydWeb: Install it using pip:

pip install scrapydweb

Start the ScrapydWeb Server: Run the command scrapydweb from your project folder. The interface will be accessible at http://127.0.0.1:5000.

Schedule and Monitor Spiders: Use the interface to schedule, run, and monitor your spiders. You can also set cron jobs and configure spider parameters like user agents and cookies.

Gerapy: Advanced Spider Management

Gerapy is another tool built on Django and Scrapy for managing spiders. It offers additional features like scheduling cron jobs, a visual code editor, and more.

Install Gerapy: Install Gerapy using pip:

pip install gerapy

Set Up Gerapy: Initialize and configure Gerapy to sync with Scrapyd by following the setup instructions.

Create and Schedule Tasks: Use Gerapy’s web interface to create tasks, schedule them using interval or cron triggers, and monitor task performance.

Conclusion

Scrapyd is a robust solution for managing Scrapy spiders, enabling efficient task scheduling, monitoring, and scaling. Using Scrapyd’s API, ScrapydWeb, or Gerapy, you can streamline your web scraping workflows and improve productivity.

Similar Posts