Web Scraping With PHP: Step-By-Step Tutorial
In this guide, I’ll walk you through creating a simple web scraper with PHP. Don’t worry if you’re new to the process; I’ll take it one step at a time. By the end, you’ll know how to pull data from websites and store it in files for future use. Let’s dive in!
What is Web Scraping?
Web scraping is the process of fetching data from a website by extracting useful information from its HTML structure. It allows developers to access data from sources that do not provide APIs or where APIs are insufficient for specific use cases.
For example, you may want to scrape e-commerce websites to extract product prices, descriptions, or images. This data can then be used for comparison, analysis, or just stored for later use.
Why PHP for Web Scraping?
While Python and JavaScript have many scraping libraries like Scrapy, BeautifulSoup, and Puppeteer, PHP also excels at web scraping for several reasons:
- Simple Setup: PHP is easy to set up and doesn’t require additional installations like some other programming languages.
- Built-in Tools: PHP offers cURL, a built-in library for making HTTP requests, which is essential for web scraping.
- HTML Parsing: PHP offers libraries like Simple HTML DOM Parser, which make it easy to extract and manipulate HTML data.
- Wide Availability: PHP is commonly used on many hosting platforms, making it a convenient choice for web scraping tasks without needing special permissions or server configurations.
Read my article about the best PHP web scraping libraries.
Leveraging Advanced Proxies
Sometimes web scraping can hit roadblocks like IP blocking or CAPTCHAs. When that happens, integrating an advanced proxy service can keep your scraper running smoothly. Bright Data’s proxy solution offers automated IP rotation and smart features to help bypass anti-bot measures. This means you can focus on extracting and analyzing data with PHP, without getting stuck on connectivity issues.
Check out my list of the best proxy providers for web scraping.
Let’s now walk through how to set up your first web scraper using PHP.
Step 1: Setting Up Your PHP Environment
To get started with web scraping in PHP, you’ll need the following:
- PHP: Install the latest version of PHP (8.3 or above) from the official website or via a package manager.
- IDE: Use any IDE, but we recommend Visual Studio Code for its ease of use and rich extension support. If you want an IDE built specifically for web scraping, try this tool.
- Libraries: You will need the Simple HTML DOM Parser and cURL (which comes preinstalled with PHP) to make HTTP requests and parse HTML content.
Install PHP and Required Libraries
- Download and install PHP from PHP’s official website.
- For Windows, you can use tools like XAMPP or WAMP, which provide an easy-to-use environment for PHP development.
- Download the Simple HTML DOM Parser from SourceForge and add it to your project folder.
Once you have PHP and the required libraries set up, your project directory should look something like this:
/web-scraping-project
├── scraper.php
└── simple_html_dom.php
Step 2: Make HTTP Requests with cURL
To scrape data from a website, you first need to send an HTTP request to the website’s server. PHP’s cURL library is perfect for this. Here’s how to use cURL to fetch the HTML content of a webpage.
Basic cURL Request Example
<?php
// URL of the website you want to scrape
$url = "https://example.com";
// Initialize cURL session
$curl = curl_init();
// Set cURL options
curl_setopt($curl, CURLOPT_URL, $url); // Specify the URL
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // Return response as a string
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); // Follow redirects
// Execute the cURL session
$htmlContent = curl_exec($curl);
// Check for errors
if ($htmlContent === false) {
echo "Error: " . curl_error($curl);
exit;
}
// Close the cURL session
curl_close($curl);
// Output the fetched HTML content
echo $htmlContent;
?>
This code sends a request to https://example.com and prints the raw HTML response. This is the first step of web scraping: fetching the data you need.
Step 3: Parsing HTML Content
Once you’ve fetched the raw HTML, the next step is to parse it and extract the information you need. PHP doesn’t have built-in HTML parsing functions, but the Simple HTML DOM Parser makes this easy.
Let’s use this library to extract data from the webpage. Start by including the library in your script and then use its functions to navigate the HTML structure.
Extracting Data Using Simple HTML DOM Parser
<?php
// Include the Simple HTML DOM Parser library
include_once('simple_html_dom.php');
// Fetch the HTML content using cURL (as shown previously)
$htmlContent = file_get_contents('https://example.com');
// Create a Simple HTML DOM object
$html = str_get_html($htmlContent);
// Extract data using CSS selectors
$title = $html->find('title', 0)->plaintext; // Extract the title of the page
$heading = $html->find('h1', 0)->plaintext; // Extract the first h1 tag
// Output the extracted data
echo "Page Title: $title\n";
echo "Heading: $heading\n";
?>
In this example, we fetch the title and first heading (h1) from the webpage. The find method in Simple HTML DOM Parser works similarly to jQuery, where you can pass CSS selectors to extract specific elements.
Step 4: Scraping Multiple Items
Most of the time, you’ll want to extract more than just a single piece of data. Let’s extend the previous example to scrape multiple items. We’ll scrape product data (name, price, and image URL) from an e-commerce site.
Scraping Multiple Products
<?php
// Include the Simple HTML DOM Parser
include_once('simple_html_dom.php');
// URL of the e-commerce site
$url = 'https://example.com/products';
// Fetch the HTML content using cURL
$htmlContent = file_get_contents($url);
// Create a Simple HTML DOM object
$html = str_get_html($htmlContent);
// Find all product containers
$products = $html->find('.product'); // Use the appropriate CSS selector for product items
// Loop through each product and extract data
foreach ($products as $product) {
$name = $product->find('.product-title', 0)->plaintext;
$price = $product->find('.product-price', 0)->plaintext;
$image = $product->find('img', 0)->src;
// Output the extracted product information
echo "Product: $name\n";
echo "Price: $price\n";
echo "Image URL: $image\n";
echo " - - - - - - - - - - - -\n";
}
?>
Here, we use the .product selector to find each product on the page, and then extract the name, price, and image URL for each product. You can adjust the CSS selectors based on the actual structure of the website you’re scraping.
Step 5: Storing Data in a CSV File
Once you’ve extracted the data, you’ll likely want to save it for later use. A common way to do this is to export the data to a CSV file. PHP has built-in functions for working with CSV files, making this easy.
Exporting Data to CSV
<?php
// Array of scraped product data
$productData = [
['Name' => 'Product 1', 'Price' => '$10.00', 'Image URL' => 'https://example.com/image1.jpg'],
['Name' => 'Product 2', 'Price' => '$20.00', 'Image URL' => 'https://example.com/image2.jpg'],
];
// Specify the path to the CSV file
$csvFilePath = 'products.csv';
// Open the CSV file for writing
$file = fopen($csvFilePath, 'w');
// Write the column headers
fputcsv($file, array_keys($productData[0]));
// Write each product's data to the CSV file
foreach ($productData as $product) {
fputcsv($file, $product);
}
// Close the CSV file
fclose($file);
// Output a success message
echo "CSV file created successfully: $csvFilePath\n";
?>
This code takes the product data from an array and writes it to a CSV file. The fputcsv function is used to write each row of data to the file. After running the script, you’ll have a products.csv file containing the scraped information.
Step 6: Handling Errors and Improving the Scraper
While the examples above cover the basics of web scraping, real-world scraping often involves more complex tasks, such as handling errors, dealing with CAPTCHAs, respecting robots.txt, and scraping large volumes of data.
Here are some additional tips:
- Error Handling: Always check for errors in your cURL requests and handle them appropriately.
- Respect Robots.txt: Websites may have rules in their robots.txt file to prevent scraping. Always check and comply with these rules.
- Rate Limiting: Avoid sending too many requests in a short period to avoid being blocked. You can add delays between requests using sleep() in PHP.
- User-Agent: Some websites block requests from non-browser clients. Set a User-Agent header to mimic a web browser.
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36');
Advanced Web Scraping Techniques with PHP
Scraping large volumes of data from websites requires more than basic scraping techniques. To scrape data efficiently and at scale, you’ll need to implement advanced methods, including web crawling, handling dynamic content, and circumventing anti-bot protections. Let’s explore how to tackle each of these challenges with PHP.
1. Web Crawling with PHP
Web crawling refers to the process of systematically navigating through all the pages of a website and extracting the required data from each page. It’s essential for scraping paginated websites, where content is spread across multiple pages.
The scraper you created in previous sections extracts data from a single page. Now, let’s enhance it by crawling through all the pages on a website and scraping the data from each one.
Step 1: Inspect the Pagination Mechanism
Before starting, identify how the website handles pagination. In most cases, pagination links are provided by an HTML element such as a “Next” button or a series of page numbers. By inspecting the page’s HTML, you can locate the link that points to the next page and use it to crawl further.
Step 2: Implementing Web Crawling
Let’s modify the previous scraper to crawl through paginated content.
<?php
include_once("simple_html_dom.php");
$url = "https://scrapingcourse.com/ecommerce/";
$productData = array();
function scraper($url) {
echo "Scraping page: $url\n";
// Initialize cURL session
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$htmlContent = curl_exec($curl);
if ($htmlContent === false) {
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
curl_close($curl);
// Parse the HTML content
$html = str_get_html($htmlContent);
$products = $html->find(".product");
foreach ($products as $product) {
$name = $product->find(".woocommerce-loop-product__title", 0);
$image = $product->find("img", 0);
$price = $product->find("span.price", 0);
if ($name && $price && $image) {
$decodedPrice = html_entity_decode($price->plaintext);
$productInfo = array(
"Name" => $name->plaintext,
"Price" => $decodedPrice,
"Image URL" => $image->src
);
global $productData;
$productData[] = $productInfo;
}
}
// Look for the "Next" page link
$nextPageLink = $html->find("a.next", 0);
if ($nextPageLink) {
$nextPageUrl = $nextPageLink->href;
scraper($nextPageUrl); // Recursively scrape the next page
}
}
// Start scraping from the first page
scraper($url);
// Output the collected data
print_r($productData);
?>
This script uses cURL to fetch HTML content, parses it with Simple HTML DOM Parser, and looks for product data (name, price, and image URL). It then looks for a “Next” page link, recursively scraping all pages until there are no more links.
2. Handling Dynamic Content in PHP
Some websites use JavaScript to load content dynamically after the page is initially loaded. Simple HTML DOM Parser won’t work for such websites since it only processes static HTML. For dynamic content, we need a headless browser like Selenium, which can render JavaScript.
Step 1: Using Selenium with PHP
Selenium allows you to simulate a real user by controlling a browser programmatically. For PHP, you can use the php-webdriver package to interact with a Selenium WebDriver.
To set up Selenium in PHP, follow these steps:
Install Composer (if not already installed).
Install Selenium WebDriver for PHP using Composer:
composer require php-webdriver/webdriver
composer require symfony/css-selector
Download Selenium Server and start it locally:
java -jar selenium-server-<version>.jar standalone - selenium-manager true
Step 2: Implementing Dynamic Scraping with Selenium
Here’s how you can use Selenium to scrape data from a page that loads dynamically:
<?php
namespace Facebook\WebDriver;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Chrome\ChromeOptions;
require_once("vendor/autoload.php");
$host = "http://localhost:4444/"; // Local Selenium server
$capabilities = DesiredCapabilities::chrome();
$chromeOptions = new ChromeOptions();
$chromeOptions->addArguments([" - headless"]); // Run Chrome in headless mode
$capabilities->setCapability(ChromeOptions::CAPABILITY_W3C, $chromeOptions);
$driver = RemoteWebDriver::create($host, $capabilities);
function scraper($driver) {
$driver->manage()->window()->maximize();
$products = $driver->findElements(WebDriverBy::cssSelector(".product-container"));
foreach ($products as $product) {
$name = $product->findElement(WebDriverBy::cssSelector(".product-name"))->getText();
$price = $product->findElement(WebDriverBy::cssSelector(".product-price"))->getText();
$image_url = $product->findElement(WebDriverBy::cssSelector("img"))->getAttribute("src");
echo "Name: $name\n";
echo "Price: $price\n";
echo "Image URL: $image_url\n";
}
}
// Open the target page
$driver->get("https://www.scrapingcourse.com/infinite-scrolling");
$lastHeight = $driver->executeScript("return document.body.scrollHeight");
while (true) {
$driver->executeScript("window.scrollTo(0, document.body.scrollHeight);");
sleep(2);
$newHeight = $driver->executeScript("return document.body.scrollHeight");
if ($newHeight == $lastHeight) {
scraper($driver);
break;
}
$lastHeight = $newHeight;
}
// Close the browser
$driver->quit();
?>
In this code, we use Selenium to open a webpage, scroll it, and extract data dynamically rendered by JavaScript. After scrolling to the bottom of the page and ensuring all content has loaded, the scraper extracts the product name, price, and image URL.
3. Avoiding Blocks When Scraping with PHP
Many websites employ anti-bot measures like IP blocking, CAPTCHA, or rate-limiting to prevent scraping. To bypass these, we can use techniques like IP rotation and mimicking legitimate user behavior.
Step 1: Use Proxies to Avoid Detection
One effective way to avoid getting blocked is to use proxies. By rotating proxies with each request, you can avoid triggering anti-bot systems that flag multiple requests from the same IP.
You can use services like Bright Data, which offer automated proxy rotation and bypass anti-bot systems like CAPTCHA and JavaScript challenges.
Step 2: Mimicking a Legitimate User
Another way to avoid detection is to mimic the headers of a real browser. For example, you can set the User-Agent header to impersonate a real browser, like Chrome or Firefox.
Here’s an example of how to add a custom User-Agent header to your scraper:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://www.example.com");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
// Set a custom User-Agent to mimic a real browser
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");
$htmlContent = curl_exec($curl);
curl_close($curl);
echo $htmlContent;
?>
Step 3: Using a Web Scraping API
For more advanced anti-bot protection, such as Cloudflare or Akamai, you might need to use a dedicated web scraping API like Bright Data. It automatically handles proxy rotation, CAPTCHA bypassing, and JavaScript rendering, making scraping even the most challenging websites easier.
<?php
$apiUrl = "https://api.Bright Data.com/v1/?apikey=<YOUR_Bright Data_API_KEY>&url=" . urlencode("https://www.example.com");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $apiUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
echo $response;
?>
Conclusion
Bright Data automates the entire scraping process, allowing you to focus on data extraction rather than dealing with blockers.
Web scraping with PHP is a powerful way to extract data from websites, and the combination of cURL and Simple HTML DOM Parser makes the process straightforward and effective. Whether you’re scraping product data, news articles, or social media feeds, PHP can handle the task with ease.
When scraping websites, be mindful of legal and ethical concerns, including respecting the site’s robots.txt file and terms of service. With great power comes great responsibility!