Scala Web Scraping

Scala Web Scraping: Step-by-Step Tutorial 2025

In this tutorial, you’ll learn how to set up your Scala environment, extract data from static web pages, handle paginated content, and scrape dynamic websites with tools like Selenium. By the end, you’ll have a solid grasp of Scala web scraping techniques.

Why Choose Scala for Web Scraping?

Scala is a practical choice for web scraping due to:

  • Simple and expressive syntax: Its concise syntax makes code easy to read and maintain.
  • Interoperability: Scala can leverage Java libraries such as Jsoup, Selenium, and HtmlUnit.
  • Functional capabilities: It supports functional programming paradigms that simplify data manipulation tasks.

While Python and Node.js are more popular for web scraping, Scala offers a unique balance between performance and flexibility.

Web Scraping Without Code

If you want to scrape websites without any code, I suggest you take a look at these platforms:

  1. Bright Data — Enterprise-grade scraper with high scalability and automation.
  2. Octoparse — User-friendly, flexible data extraction with scheduling.
  3. ParseHub — Intuitive point-and-click scraper for structured data.
  4. Apify — Cloud-based scraping with pre-built templates and APIs.
  5. Web Scraper — Browser extension for quick and simple web scraping.

Interested in learning more? View the full no-code web scrapers article. I am not affiliated with any of the providers!

Prerequisites and Environment Setup

To start scraping in Scala, you need to prepare your environment.

Install Java and Scala

  1. Java: Scala requires a Java Development Kit (JDK). Download the latest LTS version of the JDK if you don’t already have it installed.
  2. Scala: Download and install Scala 3.x using the Coursier package manager.
  3. SBT: The Scala Build Tool (sbt) is essential for project management. It comes bundled with the Scala installer.

Setting Up Your Scala Project

Create a new Scala project by following these steps:

  1. Navigate to the directory where you want to create your project.
  2. Run the following command to generate a new project template:
sbt new scala/scala3.g8

Name your project when prompted. For example, name it scala-web-scraper.

Enter the project folder:

cd scala-web-scraper

Import the project into your preferred IDE (e.g., IntelliJ IDEA or Visual Studio Code).

Creating Your First Web Scraper

We’ll use the popular Scala library scala-scraper to build our web scraper.

Step 1: Add Dependencies

Edit the build.sbt file and add the following dependency:

libraryDependencies  = "net.ruippeixotog" %% "scala-scraper" % "3.1.1"

Run the update command to install the library:

sbt update

Step 2: Connect to a Web Page

In src/main/scala/Main.scala, create a Scala object that retrieves HTML from a webpage.

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
object ScalaScraper {
def main(args: Array[String]): Unit = {
// Initialize the browser
val browser = JsoupBrowser()
// Connect to the webpage
val doc = browser.get("https://www.scrapingcourse.com/ecommerce/")
// Extract and print the HTML source
val html = doc.toHtml
println(html)
}
}

Run the project with:

sbt run

This script connects to the page and prints its HTML content.

Extracting Data from HTML Elements

To scrape specific elements from the webpage, you can use CSS selectors. Add these imports to your Main.scala file:

import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

Now, scrape product details from the target page.

val htmlProductElement = doc >> element("li.product")
val name = htmlProductElement >> text("h2")
val url = htmlProductElement >> element("a") >> attr("href")
val image = htmlProductElement >> element("img") >> attr("src")
val price = htmlProductElement >> text("span")
println(s"Name: $name")
println(s"URL: $url")
println(s"Image: $image")
println(s"Price: $price")

Scraping Multiple Items

Let’s extract all product elements on the page.

val htmlProductElements = doc >> elementList("li.product")
val products = htmlProductElements.map { element =>
val name = element >> text("h2")
val url = element >> element("a") >> attr("href")
val image = element >> element("img") >> attr("src")
val price = element >> text("span")
Product(name, url, image, price)
}
products.foreach(println)

Define a Product case class to store the data:

case class Product(name: String, url: String, image: String, price: String)

Exporting Data to CSV

To save the scraped data, use the scala-csv library.

Step 1: Add the Dependency

Add the following to build.sbt:

libraryDependencies  = "com.github.tototoshi" %% "scala-csv" % "1.3.10"

Step 2: Write Data to CSV

Use the CSV writer to export the products:

Implementing Web Crawling

To scrape multiple pages, we need to crawl through pagination links.

Add Scala collections to manage pages:

import scala.collection.mutable._
val pagesToScrape = Queue("https://www.scrapingcourse.com/ecommerce/page/1/")
val pagesDiscovered = Set(pagesToScrape.head)
val products = ListBuffer[Product]()

Implement the crawling logic:

while (pagesToScrape.nonEmpty) {
val page = pagesToScrape.dequeue()
val doc = browser.get(page)
// Extract products from the page
val productElements = doc >> elementList("li.product")
productElements.foreach { element =>
val name = element >> text("h2")
val url = element >> element("a") >> attr("href")
val image = element >> element("img") >> attr("src")
val price = element >> text("span")
products  = Product(name, url, image, price)
}
// Discover new pages
val paginationLinks = doc >> elementList("a.page-numbers")
paginationLinks.foreach { link =>
val nextPage = link >> attr("href")
if (!pagesDiscovered.contains(nextPage)) {
pagesDiscovered  = nextPage
pagesToScrape.enqueue(nextPage)
}
}
}

Scraping JavaScript-Rendered Pages with Selenium

For dynamic pages, you can use Selenium to control a headless browser.

Step 1: Add Selenium Dependency

Download Selenium and ChromeDriver. Set them up in your environment.

Step 2: Scrape Dynamic Content

import org.openqa.selenium.chrome.ChromeDriver
import org.openqa.selenium.chrome.ChromeOptions
import org.openqa.selenium.By
import scala.jdk.CollectionConverters._
object SeleniumScraper {
def main(args: Array[String]): Unit = {
val options = new ChromeOptions()
options.addArguments(" - headless")
val driver = new ChromeDriver(options)
driver.get("https://scrapingclub.com/exercise/list_infinite_scroll/")
val products = driver.findElements(By.cssSelector(".post")).asScala.map { element =>
val name = element.findElement(By.cssSelector("h4")).getText
val url = element.findElement(By.cssSelector("a")).getAttribute("href")
val image = element.findElement(By.cssSelector("img")).getAttribute("src")
val price = element.findElement(By.cssSelector("h5")).getText
Product(name, url, image, price)
}
products.foreach(println)
driver.quit()
}
}

Avoiding Anti-Bot Measures

To prevent getting blocked:

  1. Set a real-world User-Agent.
  2. Use rotating proxies with tools like Bright Data.
val browser = JsoupBrowser("Mozilla/5.0", Proxy("your-proxy-ip", 8080))

Conclusion

Congratulations! You’ve learned how to perform web scraping in Scala, covering static and dynamic content. Following this tutorial, you now understand how to set up your environment, extract data, handle pagination, and export results to CSV. Keep experimenting to refine your skills and build more advanced scrapers!

Similar Posts