Infinite Scroll Scraping with Scrapy and Splash
Here, I’ll walk you through how to set up Scrapy with Splash, tackle infinite scrolling, pull in dynamic content, and deal with common scraping challenges. Let’s dive in and see how it’s done, step by step.
Why Scrape Infinite Scrolling Content?
Infinite scrolling is often seen on e-commerce sites, social media platforms, and news aggregators, where additional items load as the user scrolls down. A basic HTML parser won’t suffice to scrape this kind of site since new content only appears after a scrolling action. Headless browsers like Splash come into play, helping simulate scrolling and load dynamic content for effective scraping.
The Basics of Scrapy and Splash
Scrapy is an open-source web scraping framework in Python, known for its speed, simplicity, and extensibility. It provides a structured way to organize code and extract information from websites.
Splash is a headless browser specifically built for web scraping. It can execute JavaScript and render HTML pages. When integrated with Scrapy as Scrapy-Splash, it allows us to scrape websites that rely on JavaScript for loading content, such as those with infinite scroll.
Skip Infinite Scrolling Scraping — Get The Data
Below are listed the top dataset websites, give those a look if you project is too complicated and you don’t want to waste time:
- Bright Data — Customizable and pre-built datasets across industries.
- Statista — Extensive statistics and reports for business and research.
- Datarade — Marketplace for premium data products from various providers.
- AWS Data Exchange — Third-party datasets integrated with AWS services.
- Zyte — Web scraping and custom datasets tailored to business needs.
- Data & Sons — Open marketplace for buying and selling diverse datasets.
- Coresignal — Workforce analytics with extensive job-related data.
- Oxylabs — Specialized company data and web scraping services.
- Bloomberg Enterprise Data Catalog — Financial data for enterprise use.
- Kaggle — Free public datasets and tools for data science.
Step 1: Setting Up Scrapy with Splash
To start using Splash with Scrapy, follow these initial setup steps.
1. Install Scrapy-Splash Open your terminal and install the scrapy-splash package:
pip install scrapy-splash
2. Run Splash in Docker Since Splash requires Docker to run effectively, ensure Docker is installed and running on your machine. Use the following command to pull the Splash Docker image:
docker pull scrapinghub/splash
Then start the Splash server:
docker run -it -p 8050:8050 - rm scrapinghub/splash
Now, Splash will be available at http://localhost:8050, ready to render JavaScript for your Scrapy spider.
Step 2: Writing a Lua Script for Scrolling
Splash’s Lua scripting feature allows you to manipulate the browser, scroll down, and wait for new content to load. The following Lua script scrolls to the bottom of the page, waits for content, and repeats this process several times.
Lua Script for Scrolling
function main(splash, args)
splash:go(args.url)
splash:wait(args.wait)
local scroll_to = splash:jsfunc('window.scrollTo')
local get_body_height = splash:jsfunc([[
function() {
return document.body.scrollHeight;
}
]])
local scroll_count = 0
for _ = 1, args.max_scrolls do
scroll_count = scroll_count + 1
scroll_to(0, get_body_height())
splash:wait(args.scroll_delay)
end
return {
html = splash:html(),
scroll_count = scroll_count
}
end
In this script:
- splash:go(args.url) loads the target URL.
- splash:wait(args.wait) pauses to allow initial page elements to load.
- The for loop scrolls the page multiple times, waiting briefly (args.scroll_delay) after each scroll to allow new content to load.
Step 3: Integrating Lua Script in a Scrapy Spider
With the Lua script ready, the next step is to set up a Scrapy spider to execute it. This spider sends a Splash request to the target site and passes in the Lua script.
Spider Code
import scrapy
from scrapy_splash import SplashRequest
class InfiniteScrollSpider(scrapy.Spider):
name = 'infinite_scroll_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/target_page']
lua_script = """
function main(splash, args)
splash:go(args.url)
splash:wait(args.wait)
local scroll_to = splash:jsfunc('window.scrollTo')
local get_body_height = splash:jsfunc([[
function() {
return document.body.scrollHeight;
}
]])
local scroll_count = 0
for _ = 1, args.max_scrolls do
scroll_count = scroll_count + 1
scroll_to(0, get_body_height())
splash:wait(args.scroll_delay)
end
return {
html = splash:html(),
scroll_count = scroll_count
}
end
"""
def start_requests(self):
yield SplashRequest(
self.start_urls[0],
self.parse,
endpoint='execute',
args={
'lua_source': self.lua_script,
'wait': 2,
'scroll_delay': 1,
'max_scrolls': 8
}
)
def parse(self, response):
for item in response.css('.item-selector'):
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get()
}
Explanation:
- The spider sends a SplashRequest with lua_source pointing to the Lua script.
- Arguments like wait, scroll_delay, and max_scrolls define the script’s scroll behavior.
- The parse function extracts item data (name and price) from each scroll iteration.
Step 4: Handling Pagination and “Load More” Buttons
Many infinite scroll pages use a hidden “Load More” button, activated when the user scrolls to the bottom. Splash’s Lua script can handle this by clicking on the “Load More” button when it appears.
Modified Lua Script for “Load More” Button
function main(splash, args)
splash:go(args.url)
splash:wait(args.wait)
local scroll_to = splash:jsfunc('window.scrollTo')
local get_body_height = splash:jsfunc([[
function() {
return document.body.scrollHeight;
}
]])
local scroll_count = 0
for _ = 1, args.max_scrolls do
scroll_count = scroll_count + 1
scroll_to(0, get_body_height())
splash:wait(args.scroll_delay)
local load_more = splash:select('.load-more-button')
if load_more then
load_more:mouse_click()
splash:wait(1)
end
end
return splash:html()
end
Here, the load_more variable locates the “Load More” button using its selector. If found, it simulates a click, waits for content to load, and repeats scrolling.
Step 5: Bypassing Anti-Bot Protections
Infinite scroll pages often have anti-bot measures, including CAPTCHAs, rate limiting, and IP bans. Techniques to bypass these include:
- Proxy Rotation: Changing IP addresses prevents detection. Services like ZenRows and ScraperAPI offer IP rotation with minimal setup.
- User-Agent Rotation: Avoid detection by randomizing your User-Agent string with each request.
- Headless Browser: Splash operates in headless mode, making your requests appear more like real user traffic.
Here’s how you could implement a rotating proxy using ZenRows:
import scrapy
class InfiniteScrollSpider(scrapy.Spider):
name = 'proxy_spider'
allowed_domains = ['example.com']
def start_requests(self):
proxy = 'http://<YOUR_ZENROWS_API_KEY>@api.zenrows.com:8001'
url = 'http://example.com/target_page'
yield scrapy.Request(
url,
callback=self.parse,
meta={'proxy': proxy}
)
def parse(self, response):
# parsing logic
This example configures Scrapy to use ZenRows as a proxy for each request.
Step 6: Putting It All Together
Here’s the full code for an infinite scroll Scrapy spider, complete with Splash integration, a Lua script, and proxy rotation:
import scrapy
from scrapy_splash import SplashRequest
class FullInfiniteScrollSpider(scrapy.Spider):
name = 'full_infinite_scroll'
allowed_domains = ['example.com']
start_urls = ['http://example.com/target_page']
lua_script = """
function main(splash, args)
splash:go(args.url)
splash:wait(args.wait)
local scroll_to = splash:jsfunc('window.scrollTo')
local get_body_height = splash:jsfunc([[
function() {
return document.body.scrollHeight;
}
]])
for _ = 1, args.max_scrolls do
scroll_to(0, get_body_height())
splash:wait(args.scroll_delay)
local load_more = splash:select('.load-more-button')
if load_more then
load_more:mouse_click()
splash:wait(1)
end
end
return splash:html()
end
"""
def start_requests(self):
yield SplashRequest(
self.start_urls[0],
self.parse,
endpoint='execute',
args={
'lua_source': self.lua_script,
'wait': 2,
'scroll_delay': 1,
'max_scrolls': 10
}
)
def parse(self, response):
for item in response.css('.item-selector'):
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get()
}
Conclusion
When scraping infinite scroll pages, I find Scrapy combined with Splash compelling. It’s a great setup that easily lets you tackle even the trickiest dynamic sites. With Splash, I can handle the JavaScript rendering, while Scrapy excels at the data extraction part. Splash’s Lua scripting makes it possible to interact with elements on the page, loading additional content as if I were a user scrolling through.