Web Scraping: Intercepting XHR Requests

Web Scraping: Intercepting XHR Requests

In this article, I’m going to show you how to intercept these XHR requests and grab the data you need. We’ll dive into why this approach works so well and how you can use Python and Playwright to make it happen. Whether you’re new to web scraping or looking for more advanced techniques, this guide has you covered. Let’s get started!

What is XHR?

XMLHttpRequest (XHR) is a browser feature that allows web pages to fetch data asynchronously without reloading the page. This allows websites to update content dynamically, making them more interactive. For example, when you scroll through your Twitter feed or refresh your news feed, the page doesn’t reload, but new content is fetched in the background using XHR.

This is useful for web scraping because rather than scraping data directly from the HTML of a page (which may be incomplete or outdated), we can intercept the requests made by the page to get the raw data that’s being fetched in the background.

Why Use XHR Requests for Web Scraping?

  1. Efficiency: Websites that load data dynamically via XHR typically send structured data in JSON or XML format. These formats are much easier to work with than HTML, which may require complex parsing.
  2. Fewer Changes: Unlike HTML elements and CSS selectors, which may change frequently due to redesigns or updates, the API endpoints used for XHR requests are usually more stable over time.
  3. Avoiding HTML Parsing: Since the data is already structured, there’s no need to parse through the HTML content and use complex CSS selectors to extract the information you need. This is especially helpful for websites with complex or nested HTML structures.
  4. Reduced Bandwidth: By focusing only on the relevant XHR requests (e.g., JSON endpoints), you can avoid loading unnecessary resources like images, stylesheets, and ads, which saves bandwidth and speeds up the scraping process.

How to Intercept XHR Requests

To scrape XHR requests, you’ll need a tool that can interact with web pages, execute JavaScript, and inspect network traffic. Two popular tools for this are Puppeteer and Playwright. In this guide, we will use Playwright with Python to intercept XHR requests.

Prerequisites

Before getting started, make sure you have the following:

  1. Python 3.x installed on your computer.
  2. Playwright installed, which will allow you to interact with web browsers.
  3. Browser binaries (Chromium, Firefox, or WebKit) for Playwright.

You can install Playwright by running the following commands in your terminal:

pip install playwright
playwright install

After installing Playwright, you’re ready to start.

Example 1: Scraping Auction.com

Let’s start by scraping auction.com, which displays auction listings of properties. The website loads its content dynamically using XHR requests. Instead of scraping the HTML directly, we will intercept the XHR requests to get the data we need.

from playwright.sync_api import sync_playwright
url = "https://www.auction.com/residential/ca/"
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Intercepting the XHR requests
def handle_response(response):
if "v1/search/assets?" in response.url:
print(response.json()["result"]["assets"]["asset"])
page.on("response", handle_response)
# Go to the page and wait for it to load
page.goto(url, wait_until="networkidle", timeout=90000)
page.context.close()
browser.close()

Breakdown of the Code:

  • playwright.sync_api.sync_playwright(): Launches a headless browser session.
  • browser.new_page(): Opens a new page (or tab) in the browser.
  • page.on(“response”, handle_response): Registers a listener to intercept all network responses.
  • response.url: Checks the URL of the intercepted XHR request to find the one that contains the data we need.
  • response.json(): Extracts the JSON data from the XHR response and prints it.

In this example, the relevant data is fetched from an endpoint that contains v1/search/assets? in its URL. By filtering on this string, we capture the JSON response that contains auction data, such as the property address, price, and more.

Example 2: Scraping Twitter.com

Twitter (or X)also uses XHR requests to load tweets, followers, and other dynamic data. We can use Playwright to intercept these requests and extract the JSON data, which is more structured and easier to parse than HTML.

import json
from playwright.sync_api import sync_playwright
url = "https://twitter.com/playwrightweb/status/1396888644019884033"
with sync_playwright() as p:
def handle_response(response):
if "/TweetDetail?" in response.url:
print(json.dumps(response.json()))
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()

In this case, the endpoint we are interested in is “/TweetDetail?”. Once the XHR request is intercepted, we can extract detailed information about the tweet, such as the tweet text, user information, retweets, and likes.

Example 3: Scraping NSE India

The National Stock Exchange of India (NSE) also uses XHR requests to load real-time stock market data. Instead of scraping the page directly, we can intercept these requests to get the data more efficiently.

from playwright.sync_api import sync_playwright
url = "https://www.nseindia.com/market-data/live-equity-market"
with sync_playwright() as p:
def handle_response(response):
if "equity-stockIndices?" in response.url:
items = response.json()["data"]
[print(item["symbol"], item["lastPrice"]) for item in items]
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()

Here, the data we want is available in the JSON response from the equity-stockIndices? endpoint. The code loops through the items and prints out the stock symbols and their last prices.

Avoid Blocks When Scraping XHR Requests with Bright Data’s Web Unlocker

Some websites implement anti-bot protections that can block your scraper from accessing XHR data. These protections include CAPTCHAs, IP bans, and JavaScript challenges , making it difficult to extract data reliably. Instead of manually handling these issues, you can use Bright Data’s Web Unlocker, which automatically bypasses these restrictions and ensures a high success rate for scraping XHR requests.

Why Use Web Unlocker for XHR Scraping?

✅ Bypasses CAPTCHAs & Bot Protections — No manual solving required
✅ Handles Headers, Cookies & JavaScript Rendering — Ensures smooth scraping
✅ No Need for Manual Proxy Rotation — Works seamlessly with any target website

Example: Using Web Unlocker with Playwright in Python

import requests

proxy = "http://username:[email protected]:22225"
url = "https://example.com"

response = requests.get(url, proxies={"http": proxy, "https": proxy})
print(response.text)

🔹 How It Works:

  • Routes requests through Bright Data’s Web Unlocker , bypassing anti-bot protections
  • Automatically manages headers, cookies, and JavaScript challenges
  • Ensures consistent access to XHR data without IP bans

Best Practices for Scraping XHR Requests

  1. Handle Dynamic Content: When scraping dynamic websites that use XHR, it’s essential to understand the timing of the page load. Use wait_until=”networkidle” to ensure the page is fully loaded before scraping, or use wait_for_selector to wait for a specific element to appear.
  2. Monitor API Endpoints: Use your browser’s Developer Tools to monitor the network traffic and identify which XHR requests contain the data you need. Look for requests that return data in JSON or XML format, as these are typically easier to parse.
  3. Respect Robots.txt: Before scraping any website, make sure to check its robots.txt file to ensure that scraping is allowed. Respect the website’s rules and avoid overloading its servers with too many requests.
  4. Use Proxies and Avoid IP Blocks: Some websites may block your IP if you make too many requests in a short amount of time. To avoid this, use proxies to distribute your requests and avoid detection.
  5. Error Handling: Handle potential errors in your code, such as timeouts or missing data, to make your scraper more resilient. You can implement retries or use fallback mechanisms to ensure your scraper continues working smoothly.

Conclusion

Intercepting XHR requests for web scraping is an effective and efficient way to collect structured data from dynamic websites. By focusing on API endpoints rather than scraping HTML, you can save time, bandwidth, and effort in your scraping tasks. Playwright is a powerful tool that simplifies this process, and with the examples provided in this article, you should have a solid foundation to get started with scraping XHR requests for your projects.

Remember always to follow best practices, respect the websites you’re scraping, and handle potential issues like IP blocks and errors gracefully. With the right approach, XHR scraping can be a powerful tool in your web scraping toolkit!

Similar Posts