How to Use Botright for Web Scraping

How to Use Botright for Web Scraping?

In this guide, I’ll show you step-by-step instructions on how to use Botright. We’ll explore its best features, how to install it, write simple code, and even tackle those tricky CAPTCHAs. Don’t worry if you’re new to web scraping — this article is easy to follow with clear, straightforward explanations. Let’s dive in and make web scraping a breeze!

What is Botright?

Botright is a Python library that helps you scrape websites. It builds on Playwright, a popular tool for browser automation. Botright makes some changes that help you avoid detection. Many websites use CAPTCHAs and other measures to block bots. Botright can solve some CAPTCHAs using built-in solvers. It also changes browser fingerprints. This makes it look less like a bot when websites check your browser details.

Botright runs a real browser. It can launch Chromium or Firefox. The tool uses JavaScript to load dynamic content. This is useful when websites load data as you scroll or interact with the page. Botright runs asynchronously. This means you can perform many tasks at the same time. However, you should note that its API is not thread-safe. This means you must be careful when running many tasks concurrently. Running one browser instance per thread is the best approach.

Installing and Setting Up Botright

Before you start scraping, you need to install Botright. This section explains the installation steps in simple words.

Prerequisites

Botright works best with Python versions below 3.10. If you have a newer version, you may need to install an older version. You can install multiple Python versions on your computer. Use a virtual environment to manage your Python version. This keeps your projects separate and avoids conflicts.

Installation Steps

Install Botright Using pip:

Open your terminal or command prompt. Run the following command to install Botright:

pip install botright

This command will also install Playwright as a dependency.

Download Browser Binaries:

After the installation, you need to download the required browser binaries. Run this command:

playwright install

This command downloads the browsers that Botright will use.

Install a Special Browser (Optional):

For better anti-bot capabilities, you can install the Ungoogled Chromium browser. This is a privacy-focused version of Chromium. Download it from its official page and install it on your system. Botright will use it automatically if it is installed.

Set Up Your Project:

Create a new folder for your project. Inside this folder, create a new file called scraper.py. Use any code editor, such as VS Code or Sublime Text.

Once you have installed Botright and set up your project folder, you can start coding your web scraper.

Building a Basic Web Scraper with Botright

In this section, we will write a simple scraper. Our goal is to extract product information from a dynamic website. We will use Botright to load the page and extract the data.

Step 1: Import Libraries

Start by opening your scraper.py file. Import the necessary libraries. You will need asyncio for asynchronous programming, Botright for browser control, and csv to write data to a file. Write the following code:

import asyncio
import botright
import csv

Step 2: Create a Scraper Function

Create a function called scrape_page that takes a page instance as input. This function will find all product elements on the page. Assume that each product is contained in a

with a class named “product-item”. Write the following code:
async def scrape_page(page):
# Select all elements with the product class
products = await page.query_selector_all(".product-item")
# Create a list to store the data
product_list = []
# Loop over each product element
for product in products:
# Get the product name, price, and image
name_element = await product.query_selector(".product-name")
price_element = await product.query_selector(".product-price")
image_element = await product.query_selector("img")
# Extract the text or attribute from each element
name = await name_element.inner_text()
price = await price_element.inner_text()
image_url = await image_element.get_attribute("src")
# Create a dictionary with the product data
product_data = {
"name": name,
"price": price,
"image": image_url
}
# Add the data to the list
product_list.append(product_data)
# Return the complete list of products
return product_list

Step 3: Export Data to CSV

After scraping the product data, you will want to save it. You can export the data to a CSV file. Add a function that writes the data to a file:

def save_to_csv(data):
# Define the CSV file name
filename = "products.csv"
# Define the field names that match the data keys
fieldnames = ["name", "price", "image"]
# Open the CSV file for writing
with open(filename, "w", newline="", encoding="utf-8") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
# Write each product dictionary as a row in the CSV
for item in data:
writer.writerow(item)
print("Data saved to", filename)

Step 4: Putting It All Together

Now, you need to create a function that opens a browser, loads the page, and calls the scraping function. This function is called run_scraper:

async def run_scraper():
# Start Botright in headless mode
client = await botright.Botright(headless=True)
browser = await client.new_browser()
# Open a new page in the browser
page = await browser.new_page()
# Navigate to the target website
await page.goto("https://example.com/products")
# Call the scraper function to get product data
products = await scrape_page(page)
# Save the data to a CSV file
save_to_csv(products)
# Close the browser
await client.close()

Finally, run the scraper using the asyncio event loop:

if __name__ == "__main__":
asyncio.run(run_scraper())

This code shows a basic example of using Botright for web scraping. It loads a page, extracts product data, and saves it to a CSV file.

Dealing with Infinite Scrolling Pages

Many modern websites load data as you scroll down, a feature called infinite scrolling. Botright can help you handle this feature. You can program it to scroll down and load new content until the end of the page.

Step 1: Define the Scrolling Function

Create a function that scrolls the page. This function will scroll to the bottom and wait for new content to load. Here is an example:

async def auto_scroll(page):
last_height = 0
while True:
# Scroll to the bottom of the page
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to load
await page.wait_for_timeout(3000)
# Get the new height after scrolling
new_height = await page.evaluate("document.body.scrollHeight")
# Check if the page height has changed
if new_height == last_height:
# No new content is loaded; break the loop
break
last_height = new_height

Step 2: Integrate Scrolling With Scraping

Now, modify the run_scraper function to use the scrolling function before scraping. This ensures that all the products are loaded on the page.

async def run_scraper_infinite():
# Start Botright in headless mode
client = await botright.Botright(headless=True)
browser = await client.new_browser()
# Open a new page in the browser
page = await browser.new_page()
# Navigate to the infinite scrolling page
await page.goto("https://example.com/infinite-products")
# Scroll down to load all content
await auto_scroll(page)
# Scrape the loaded page content
products = await scrape_page(page)
# Save the scraped data to a CSV file
save_to_csv(products)
# Close the browser
await client.close()
if __name__ == "__main__":
asyncio.run(run_scraper_infinite())

In this code, the auto_scroll function ensures the page is fully loaded. Then the scraper collects all the data. This method is useful when websites do not show all products at once.

Exporting Data to CSV

After scraping the data, you often need to save it for further analysis. CSV (Comma Separated Values) is a standard file format. We already saw a basic function to save data to CSV. Let’s review the process.

  1. Collect the Data: Use Botright to gather the information from the web page. The data is usually stored as a list of dictionaries.
  2. Define the CSV Format: Choose the field names. In our example, we used “name”, “price”, and “image”.
  3. Write to the CSV File: Open a new CSV file. Use Python’s csv.DictWriter to write the header and rows. This makes the CSV file ready for use in other applications like Excel.

Here is the complete function once again:

def save_to_csv(data):
filename = "products.csv"
fieldnames = ["name", "price", "image"]
with open(filename, "w", newline="", encoding="utf-8") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for item in data:
writer.writerow(item)
print("Data saved to", filename)

You can export your scraped data quickly and easily using this simple function.

Bypassing CAPTCHAs with Botright

Many websites use CAPTCHAs to stop bots. Botright includes some features that can help solve these puzzles. This section explains how to set up Botright to handle CAPTCHAs.

Step 1: Write a Function to Solve CAPTCHAs

Create a new Python file or add a new function to your script. This function will use Botright’s built-in methods to solve a CAPTCHA. For this example, we will use Google’s reCAPTCHA demo page.

import asyncio
import botright
async def solve_captcha():
# Start Botright in headless mode
client = await botright.Botright(headless=True)
browser = await client.new_browser()
# Open a new page in the browser
page = await browser.new_page()
# Navigate to the reCAPTCHA demo page
await page.goto("https://www.google.com/recaptcha/api2/demo")
# Use Botright's built-in method to solve the CAPTCHA
await page.solve_recaptcha()
# Take a screenshot to check the result
await page.screenshot(path="captcha_solved.png")
# Print a message to show success
print("CAPTCHA solved. Screenshot saved as captcha_solved.png.")
# Close the browser
await client.close()
if __name__ == "__main__":
asyncio.run(solve_captcha())

Step 2: How It Works

In this code:

  • Headless Mode: Botright runs in headless mode, so the browser does not open a visible window. This can sometimes help solve CAPTCHAs.
  • Solving the CAPTCHA: The function solve_recaptcha() is called on the page object. This uses image recognition libraries to solve the puzzle.
  • Screenshot: After solving the CAPTCHA, a screenshot is taken. This helps you verify that the CAPTCHA was solved.

Keep in mind that Botright may not solve every CAPTCHA. Its success rate can vary between 50% to 80%. More advanced CAPTCHAs may still block your attempts.

Limitations of Botright

Botright is a powerful tool, but it has some limitations that you should know about.

  • Python Version Compatibility: Botright does not support Python versions later than 3.10. To avoid conflicts, you should use Python 3.9 or below.
  • Thread Safety: The API is not thread-safe. When you run multiple scraping tasks at once, you must run a separate browser instance for each task. Otherwise, you may encounter errors or deadlocks.
  • Advanced Anti-Bot Systems: Botright can handle basic anti-bot measures. However, it struggles with advanced systems like Cloudflare, Akamai, and other high-level security services. These systems may still block your requests.
  • CAPTCHA Success Rate: The built-in CAPTCHA solvers work well for many standard CAPTCHAs. However, they may fail on more complex puzzles like Geetest. This means that your scraping task might not succeed every time.

Conclusion

So, we’ve covered everything you need to know to get started with Botright for web scraping. We explained how to install it, write code, handle infinite scrolling, export data, and solve CAPTCHAs. Botright helps you scrape websites by mimicking a human user. We also mentioned its limitations and looked at alternatives. Now, you have the basics to start your scraping projects. You can use these skills to collect data from many different websites.

Similar Posts