Elixir Web Scraping in 2025: A Comprehensive Guide
In this guide, I’ll walk you through using Elixir for web scraping. We’ll go over the basics — setting up your environment, essential tools, and best practices. I’ll also cover techniques for scraping data from paginated and JavaScript-rendered pages. Let’s see why Elixir is becoming a go-to choice for web scraping!
Why Choose Elixir for Web Scraping?
Elixir’s strengths lie in its scalability and concurrency. Built on the Erlang VM (BEAM), Elixir can handle millions of lightweight processes, each with minimal overhead. This makes Elixir ideal for handling tasks that require multiple requests to be processed in parallel, such as scraping websites with paginated content. Elixir’s syntax and tooling ecosystem make it accessible, even for developers new to functional programming.
Key reasons to consider Elixir for web scraping include:
- Concurrency and Scalability: Elixir can handle multiple requests concurrently without sacrificing speed or memory efficiency, making it ideal for scraping multiple pages or sites.
- Reliability: BEAM’s “let it crash” philosophy enables graceful failure handling, meaning your scraper can self-recover if a process crashes.
- Fault Tolerance: Elixir’s built-in supervision trees and resilient architecture mean the application can recover from unexpected issues without disrupting the entire scraping operation.
Best Automated Alternatives to Elixir
If you are not interested in scraping at all and only want the data, check out these dataset websites:
- Bright Data — Customizable and pre-built datasets across industries.
- Statista — Extensive statistics and reports for business and research.
- Datarade — Marketplace for premium data products from various providers.
- AWS Data Exchange — Third-party datasets integrated with AWS services.
- Zyte — Web scraping and custom datasets tailored to business needs.
- Data & Sons — Open marketplace for buying and selling diverse datasets.
- Coresignal — Workforce analytics with extensive job-related data.
- Oxylabs — Specialized company data and web scraping services.
- Bloomberg Enterprise Data Catalog — Financial data for enterprise use.
- Kaggle — Free public datasets and tools for data science.
Some are free, some are not, some provide free samples. Choose the one that fits your needs. I am not affiliated with any of them.
Setting Up Your Elixir Web Scraping Environment
To get started with web scraping in Elixir, you’ll need to set up an Elixir project and add two main libraries: Crawly and Floki.
- Crawly is a powerful crawling framework that mimics the structure and functionality of Scrapy, one of Python’s most popular web scraping libraries. It provides essential scraping components, including spiders, pipelines, and middleware.
- Floki is a straightforward HTML parser for Elixir. It allows you to use CSS selectors to target and retrieve specific elements from HTML documents.
Step 1: Setting Up an Elixir Project
First, install Elixir (and Erlang, if you’re on Windows), then create a new Elixir project:
mix new elixir_scraper - sup
This command initializes a new supervised project named elixir_scraper.
Step 2: Adding Dependencies
In your mix.exs file, add Crawly and Floki as dependencies:
defp deps do
[
{:crawly, "~> 0.16.0"},
{:floki, "~> 0.33.0"}
]
end
Then, install the libraries:
mix deps.get
Your Elixir project is now ready for web scraping.
Building a Simple Elixir Spider with Crawly
A Crawly spider is essentially an Elixir module that defines how to retrieve and parse data from a target website. In this example, we’ll create a spider to scrape data from ScrapingCourse.com, a mock e-commerce site.
Step 1: Creating a Spider
Generate a Crawly spider with the following command:
mix crawly.gen.spider - filepath ./lib/scrapingcourse_spider.ex - spidername ScrapingcourseSpider
This creates a file in the lib directory called scrapingcourse_spider.ex. Here’s how the initial code should look:
defmodule ScrapingcourseSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.scrapingcourse.com/ecommerce/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.scrapingcourse.com/ecommerce/"]]
end
@impl Crawly.Spider
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)
product_items =
document
|> Floki.find("li.product")
|> Enum.map(fn x ->
%{
url: Floki.find(x, "a.woocommerce-LoopProduct-link") |> Floki.attribute("href") |> Floki.text(),
name: Floki.find(x, "h2.woocommerce-loop-product__title") |> Floki.text(),
image: Floki.find(x, "img.attachment-woocommerce_thumbnail") |> Floki.attribute("src") |> Floki.text(),
price: Floki.find(x, "span.price") |> Floki.text()
}
end)
%Crawly.ParsedItem{items: product_items}
end
end
This code defines a ScrapingcourseSpider module that:
- Targets the ScrapingCourse.com website.
- Uses CSS selectors to extract data such as product name, price, image, and URL.
Step 2: Running the Spider
To execute the spider, run the following command:
iex -S mix run -e "Crawly.Engine.start_spider(ScrapingcourseSpider)"
Your scraper should log each item extracted, providing valuable insights into the data structure of the target website.
Exporting Scraped Data to CSV
To store your scraped data in a CSV file, configure the pipeline to include Crawly’s CSVEncoder:
import Config
config :crawly,
middlewares: [],
pipelines: [
{Crawly.Pipelines.CSVEncoder, fields: [:url, :name, :image, :price]},
{Crawly.Pipelines.WriteToFile, extension: "csv", folder: "output"}
]
Run the spider again, and Crawly will output a CSV file with the desired data fields in the output folder.
Handling Paginated Pages
Many websites distribute data across multiple pages. To handle pagination, inspect the target site’s navigation elements and configure your spider to follow additional pages.
For example, to scrape paginated product listings:
Identify the CSS Selector for pagination links (e.g., a.page-numbers).
Use Floki to extract each page URL and add it to Crawly’s next_requests array.
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)
product_items = # parse products…
next_requests =
document
|> Floki.find("a.page-numbers")
|> Floki.attribute("href")
|> Enum.map(&Crawly.Utils.request_from_url/1)
%Crawly.ParsedItem{items: product_items, requests: next_requests}
end
This setup enables Crawly to follow each pagination link, ensuring that all products are scraped.
Working with JavaScript-Rendered Content
Elixir, like many languages, cannot natively execute JavaScript within HTTP responses. However, you can integrate Splash (a headless browser) to handle JavaScript-rendered pages. Splash renders content server-side, providing Elixir with the complete HTML document, including JavaScript-generated elements.
Step 1: Setting Up Splash
Splash can be run as a Docker container. Pull and run the Splash image:
docker pull scrapinghub/splash
docker run -it -p 8050:8050 - rm scrapinghub/splash
Step 2: Configuring Crawly to Use Splash
In your config.exs file, configure Crawly to fetch pages through Splash:
import Config
config :crawly,
fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html", wait: 3]}
This configuration ensures Crawly fetches data via Splash, allowing it to process JavaScript-rendered content. Now, when your spider makes requests, Splash will handle any JavaScript, sending Elixir the rendered HTML.
Avoiding Blocks with Proxies and User-Agents
Many websites implement anti-bot measures, which can block repeated requests from a single IP. You can mitigate these blocks by rotating User Agents and using proxy servers.
Rotating User-Agents: Add Crawly.Middlewares.UserAgent middleware to rotate User-Agents:
config :crawly,
middlewares: [
{Crawly.Middlewares.UserAgent, user_agents: [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
]}
]
Using Proxy Servers: Use a proxy to disguise your IP, enhancing your scraper’s resilience against bans.
config :crawly,
middlewares: [
{Crawly.Middlewares.RequestOptions, [proxy: {"http://proxyaddress.com", 8000}]}
]
Advanced Elixir Web Scraping Techniques
With Elixir, you can take scraping a step further, enhancing performance with additional configurations and advanced techniques.
Parallel Requests: Crawly allows concurrent requests for faster data gathering. Set concurrent_requests_per_domain to a higher value in config.exs.
config :crawly,
concurrent_requests_per_domain: 4
Error Handling: Crawly’s error-handling capabilities allow graceful recovery when scraping encounters issues, ensuring maximum data extraction with minimal interruptions.
Conclusion
Elixir is a powerful choice for web scraping, mainly because it can handle tasks concurrently and stay resilient under heavy loads. Libraries like Crawly and Floki make it easy to get started, offering powerful tools for everything from simple setups to more complex tasks like handling pagination, scraping JavaScript-rendered pages, and avoiding blocks.
Web scraping is evolving toward more distributed, durable setups, and Elixir’s functional style and robust processing power fit perfectly. If you’re exploring data extraction on a larger scale, Elixir is a reliable option that meets modern scraping needs.