How to Use Hrequests for Web Scraping
In this guide, I’ll walk you through how to collect information from websites using Hrequests. Don’t worry if you’re new to this — I’ll keep things simple and show you step-by-step.
We’ll start by installing Hrequests, sending HTTP requests, and extracting useful data. Then, we’ll dive into handling multiple pages, dealing with dynamic content, and even using concurrency to speed things up. By the end, you’ll know how to scrape websites like a pro.
If you’re curious about web scraping and want an easy way to get started, this guide is for you. Let’s jump in!
What is Hrequests?
Hrequests stands for “human requests.” It is a Python library made for web scraping. Hrequests combines an HTTP client, an HTML parser, and a headless browser. This means you can use it to scrape both static and dynamic websites. The library is simple to use and is ideal for those who are new to web scraping.
Hrequests can make your web scraping projects easier. If you have a website that shows content in simple HTML, you can use the HTTP client and the parser. If you need to interact with websites that load data with JavaScript, you can use the headless browser feature. Hrequests also has a mock option to mimic human behavior. This can help you avoid bot detection on some websites.
Enhance Your Scraping Setup with Bright Data
Bright Data’s advanced proxy services provide robust, high-speed connections that seamlessly complement Hrequests. By integrating their rotating residential proxies into your web scraping workflow, you can bypass anti-bot measures and improve data retrieval without complicated configurations.
Discover more brands on my best proxy providers article.
Getting Started with Hrequests
Before you begin, you need to make sure you have Python 3 installed on your computer. You can download the latest version of Python from the official website. After installing Python, you can install Hrequests using pip. Open your terminal or command prompt and type the following command:
pip install -U hrequests[all]
This command will install Hrequests along with its supporting libraries. One of these libraries is Playwright, which Hrequests uses for its headless browser feature.
Next, you need to install the web drivers for Playwright. Run this command:
playwright install
After you have installed Hrequests and the web drivers, create a new folder for your project. Open your favorite code editor and create a new Python file, for example, scraper.py. You are now ready to start coding your web scraper.
Learn more about web scraping with Playwright here.
Making Your First HTTP Request
The first step in web scraping is to send an HTTP request to a website and get its HTML content. Hrequests makes this task simple. In your scraper.py file, start by importing Hrequests:
import hrequests
Next, use the get method to send a request to the target website. For example, if you want to scrape a demo e-commerce site, you can use the following code:
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
This line of code sends a GET request to the website. The response object holds the HTML of the page. To see the HTML, you can print the response text:
html_content = response.text
print(html_content)
When you run this code, the full HTML content of the page will be printed to your console. This is your first step toward web scraping.
Extracting Data from HTML
After you have the HTML content, the next step is to extract valuable data. Web pages contain many elements like paragraphs, images, and links. Hrequests has a built-in HTML parser that can help you find these elements. Let us assume you want to extract product details such as the product name, price, URL, and image source from the e-commerce page.
First, inspect the website in your browser. Right-click on a product and choose “Inspect” to see the HTML structure. You may notice that each product is inside an
- tag with a class name like “product”. With this information, you can use CSS selectors to find these elements.
Here is an example code snippet to extract product details:
# Obtain all the product containers
products = response.html.find_all(".product")
# Create an empty list to hold the product data
product_data = []
# Loop through each product container to extract details
for product in products:
data = {
"name": product.find(".woocommerce-loop-product__title").text,
"price": product.find(".price").text.replace("n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
product_data.append(data)
# Print the extracted product data
print(product_data)
In the code above, we first find all elements with the class .product. We then loop over these elements and extract the needed text and attributes. We use the replace method to remove unwanted newline characters from the price data.
When you run this code, you will see a list of dictionaries printed in your console. Each dictionary represents a product with its name, price, URL, and image source.
Saving Data to a CSV File
Storing the scraped data in a CSV file is a common practice. Python’s built-in CSV module makes this task straightforward. Once you have your product data in a list of dictionaries, you can write the data to a CSV file.
Here is how you can do it:
import csv
# Define the keys from the first product dictionary
keys = product_data[0].keys()
# Open a new CSV file in write mode
with open("product_data.csv", "w", newline="", encoding="utf-8") as output_file:
dict_writer = csv.DictWriter(output_file, fieldnames=keys)
dict_writer.writeheader()
dict_writer.writerows(product_data)
print("CSV created successfully")
This code snippet creates a CSV file named product_data.csv and writes the header and rows. You will see the data in a table format when you open the file. This method is useful for saving and later analyzing your scraped data.
Scraping Multiple Pages
Many websites use pagination to display a large number of items. If you need to scrape data from all pages, you must navigate through each page. Let us see how you can modify your code to handle multiple pages.
Using a Loop for Pagination
Assume the website shows products on multiple pages with a “Next” button. The following code uses a while loop to check for the next page and scrape data until no more pages are available:
# Define a function to scrape the current page
def scraper(response):
products = response.html.find_all(".product")
product_data = []
for product in products:
data = {
"name": product.find(".woocommerce-loop-product__title").text,
"price": product.find(".price").text.replace("n", ""),
"url": product.find("a").href,
"img": product.find("img").src
}
product_data.append(data)
return product_data
# Start by requesting the first page
response = hrequests.get("https://www.scrapingcourse.com/ecommerce/")
# List to collect all products
all_products = []
while True:
# Scrape the data from the current page
all_products.extend(scraper(response))
# Find the next page element
next_page = response.html.find(".next")
# If the next page is found, continue scraping
if next_page:
response = hrequests.get(next_page.href)
else:
break
print(all_products)
This code defines a function scraper to extract data from a page. Then, we loop through pages using the “Next” button. When there is no next page, the loop stops. This approach helps you collect data from all pages of the website.
Using Concurrency for Faster Scraping
When you have many pages to scrape, doing it one by one may take a long time. Hrequests supports concurrency. This means you can request multiple pages at once. Many websites have URLs with page numbers, such as:
https://www.scrapingcourse.com/ecommerce/page/1/
https://www.scrapingcourse.com/ecommerce/page/2/
…
Learn more about concurrent scraping with Python.
Scraping Multiple Pages Concurrently
Here is an example of how to scrape multiple pages at the same time:
# List of URLs for each page
urls = [
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
"https://www.scrapingcourse.com/ecommerce/page/3/",
"https://www.scrapingcourse.com/ecommerce/page/4/",
"https://www.scrapingcourse.com/ecommerce/page/5/",
"https://www.scrapingcourse.com/ecommerce/page/6/",
"https://www.scrapingcourse.com/ecommerce/page/7/",
"https://www.scrapingcourse.com/ecommerce/page/8/",
"https://www.scrapingcourse.com/ecommerce/page/9/",
"https://www.scrapingcourse.com/ecommerce/page/10/",
"https://www.scrapingcourse.com/ecommerce/page/11/",
"https://www.scrapingcourse.com/ecommerce/page/12/"
]
# Request all URLs concurrently
responses = hrequests.get(urls)
# List to hold all product data
all_products = []
# Loop through each response to extract product data
for response in responses:
all_products.extend(scraper(response))
print(all_products)
This method sends requests to all pages at the same time. It then loops through each response and collects the product data. Concurrency speeds up the scraping process. It is very useful when you need to scrape many pages quickly.
Scraping JavaScript-Rendered Pages
Not all websites serve static HTML. Some websites load content using JavaScript. Hrequests comes with a headless browser feature. This feature can render pages that need JavaScript. One common example is infinite scrolling. In infinite scrolling, new content loads as you scroll down the page.
Using Hrequests for Dynamic Content
Let us see how to scrape a page that uses infinite scrolling. In this example, we will scrape a website that loads product items as you scroll.
First, import the Session class from Hrequests and the time module:
from hrequests import Session
import time
Setting Up a Headless Browser
Create a session with a headless browser. You can choose a browser such as Chrome:
session = Session(browser="chrome")
Now, create a page session and load the target website:
page = session.render("https://www.scrapingcourse.com/infinite-scrolling")
Scrolling the Page
The idea is to scroll down the page until no new content loads. You can use JavaScript to scroll. Use the evaluate method to run JavaScript in the browser:
# Get the current height of the page
last_height = page.evaluate("document.body.scrollHeight")
while True:
# Scroll down the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for new content to load
time.sleep(10)
# Get the new height after scrolling
new_height = page.evaluate("document.body.scrollHeight")
# Check if the page height has not increased
if new_height == last_height:
break
# Update the height
last_height = new_height
This loop will continue scrolling until the page no longer loads new content.
Extracting Data After Scrolling
After the page stops loading new content, you can extract the product data. Define a function to scrape the data from the dynamic page:
def scraper(page):
products = page.html.find_all(".product-item")
product_data = []
for product in products:
data = {
"name": product.find(".product-name").text,
"price": product.find(".product-price").text
}
product_data.append(data)
print(product_data)
# Call the scraper function
scraper(page)
# Close the browser session
page.close()
This code will print the product names and prices from the page that uses infinite scrolling. The headless browser helps to load and render all the JavaScript content.
Handling Challenges and Limitations
While Hrequests is a powerful tool, it has some limitations. One limitation is its small user base. This can make it harder to find help if you run into problems. Another challenge is that Hrequests does not support proxy authentication. Many premium proxy services need this feature.
In addition, Hrequests may not bypass advanced anti-bot systems. Websites protected by services such as Cloudflare, Akamai, or DataDome can block your scraper. For example, if you try to scrape a site with strong anti-bot protection, you may get a blocked response instead of the full HTML content.
Overcoming Anti-Bot Measures
If you run into anti-bot systems, you may need to use an alternative solution. One such solution is the Bright Data API. Bright Data handles proxy rotation, headless browsers, and CAPTCHA solving. It helps you bypass most anti-bot measures.
Here is an example of how you can use Bright Data in your Python code:
import requests
url = "https://www.g2.com/products/asana/reviews"
apikey = ""
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.brightdata.com/v1/", params=params)
print(response.text)
This code sends a request to the Bright Data API. Bright Data then handles proxy rotation and JavaScript rendering. As a result, you can access websites that would normally block your scraper.
Bright Data is a good alternative when you face heavy anti-bot protection. However, Hrequests is still a great tool for many web scraping tasks. You can choose the tool that best fits your needs.
Best Practices for Web Scraping
When you use Hrequests or any other web scraping tool, it is important to follow some best practices. Here are a few tips:
- Respect the Website’s Robots.txt: Many websites have a robots.txt file. This file tells you which parts of the website you are allowed to scrape. Always check this file before you start scraping.
- Do Not Overload the Website: Make sure you add delays between your requests. This can help you avoid sending too many requests in a short time. Overloading the server can lead to your IP being blocked.
- Handle Errors Gracefully: Websites can change their structure at any time. Make sure you add error handling in your code. This way, your scraper can handle missing elements or connection issues without crashing.
- Use Concurrency Wisely: While concurrency can speed up your scraping, do not send too many requests at once. Respect the target website’s server load to avoid causing issues.
- Store Data Securely: When saving data to files or databases, ensure that you store it securely. Always validate and sanitize the data to avoid issues later on.
Conclusion
Hrequests is a powerful and easy-to-use library for web scraping in Python. Its ability to handle both static and dynamic content makes it a great choice for many projects. Whether you are scraping a simple e-commerce site or a complex website with infinite scrolling, Hrequests can help you get the job done. Use the examples in this guide to start your own projects. With practice and careful planning, you will be able to build efficient web scrapers that meet your needs.