Selenium in Ruby for Web Scraping Guide
If you’re looking for alternatives to building your own web scrapers, check out our Best Web Scraping Tools article. Discover top-rated tools that can simplify your data extraction projects.
Let’s get started!
Why Use Selenium for Web Scraping?
Web scraping involves extracting data from web pages automatically. While basic scraping can be done with libraries like Nokogiri, some websites use JavaScript to load content dynamically. Traditional scrapers struggle to retrieve data from such sites.
This is where Selenium helps. It allows Ruby scripts to control real browsers, interact with JavaScript-based websites, and extract the required data. You can also learn how to scrape websites with Selenium and PHP in this article.
Some advantages of Selenium for web scraping:
- Handles dynamic content: It can interact with JavaScript-based pages.
- Mimics human behavior: Selenium can click buttons, scroll, and fill forms.
- Cross-browser support: Works with Chrome, Firefox, Edge, and other browsers.
- Flexible: Supports multiple programming languages, including Ruby.
Setting Up Selenium in Ruby
Step 1: Install Selenium for Ruby
Install Ruby
First, check if you have Ruby installed. Open a terminal and type:
ruby -v
If Ruby is not installed, download it from ruby-lang.org.
Install Required Gems
Selenium requires the selenium-webdriver gem. Install it using:
gem install selenium-webdriver
You are now ready to start scraping.
Step 2: Create a Selenium Web Scraper
Set Up a Ruby Project
Create a new folder for your project:
mkdir selenium-ruby-scraper
cd selenium-ruby-scraper
Inside the folder, create a new Ruby file:
touch scraper.rb
Now, open scraper.rb in your preferred text editor.
Initialize Selenium WebDriver
In scraper.rb, write the following code:
require "selenium-webdriver"
# Set up browser options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument(" - headless") # Run in headless mode (no GUI)
# Initialize WebDriver
driver = Selenium::WebDriver.for :chrome, options: options
# Open the target website
driver.navigate.to "https://scrapingclub.com/exercise/list_infinite_scroll/"
# Get and print page source
puts driver.page_source
# Close the browser
driver.quit
Save the file and run it:
ruby scraper.rb
If everything is correct, the script will output the page’s HTML in the terminal.
Step 3: Extract Data from the Webpage
Now, let’s extract specific information. The target webpage contains product listings. Each product has a name and a price.
Modify scraper.rb:
require "selenium-webdriver"
# Define Product structure
Product = Struct.new(:name, :price)
# Set up browser options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument(" - headless")
# Initialize WebDriver
driver = Selenium::WebDriver.for :chrome, options: options
driver.navigate.to "https://scrapingclub.com/exercise/list_infinite_scroll/"
# Find all products
products = []
html_products = driver.find_elements(:css, ".post")
# Extract product details
html_products.each do |html_product|
name = html_product.find_element(:css, "h4").text
price = html_product.find_element(:css, "h5").text
products << Product.new(name, price)
end
# Print extracted data
products.each { |product| puts "Name: #{product.name}, Price: #{product.price}" }
# Close the browser
driver.quit
Run the script again:
ruby scraper.rb
Now, you should see product names and prices in the terminal.
Step 4: Handle Infinite Scrolling
Some websites use infinite scrolling, meaning content loads as the user scrolls. Selenium allows us to mimic this behavior.
Modify scraper.rb:
# Scroll down to load more products
10.times do
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(1) # Allow time for content to load
end
# Wait for the last product to load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_element(:css, ".post:nth-child(60)") }
Step 5: Save Data to CSV
Extracted data is more useful if stored in a structured format like CSV.
Modify scraper.rb:
require "csv"
# Save data to CSV
CSV.open("products.csv", "wb", write_headers: true, headers: ["Name", "Price"]) do |csv|
products.each { |product| csv << product }
end
puts "Data saved to products.csv"
Run the script again, and you will find a products.csv file in your project folder.
Step 6: Use a Proxy for Anonymity
Some websites block scrapers by detecting multiple requests from the same IP. Using a proxy can help avoid bans.
Modify scraper.rb:
# Set up proxy
proxy = "http://72.10.160.174:22669"
options.add_argument(" - proxy-server=#{proxy}")
This will route requests through the specified proxy.
Step 7: Avoid Getting Blocked
To avoid being blocked:
- Use rotating proxies.
- Set random user agents.
- Slow down requests using sleep().
- Use services like Bright Data’s Scraping Browser for bypassing anti-bot measures.
Example of setting a user agent:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
options.add_argument(" - user-agent=#{user_agent}")
For a scraping API, use:
driver.navigate.to "https://api.brightdata.com/v1/?apikey=YOUR_API_KEY&url=https://targetsite.com"
Conclusion
Selenium is a powerful tool for automating web interactions and extracting data. However, scraping responsibly is important — respect website terms of service, avoid overloading servers, and use APIs when available.
Now that you have mastered Selenium in Ruby, try scraping different websites and experiment with advanced interactions!