How to Scrape Google Flights

How to Scrape Google Flights with Python: A Step-by-Step Guide

Here, I’ll show you how to build a Google Flights scraper using Python and the Playwright library. We’ll go step-by-step, from setting up the environment to pulling and saving data. I’ll also share tips for handling common scraping issues, like IP bans and CAPTCHAs since these challenges often pop up on heavily trafficked sites like Google Flights. Let’s dive in!

Why Scrape Google Flights?

Scraping Google Flights provides access to a wealth of flight-related data, which can be used in several ways:

  • Tracking Price Changes: Track how flight prices fluctuate over time, which helps identify the best booking times.
  • Comparing Flight Options: Find flights that meet specific requirements, such as nonstop options, shorter layovers, or budget-friendly alternatives.
  • Market Analysis: For businesses, tracking data across airlines and routes can offer insights into pricing strategies and market trends.
  • Environmental Impact: Extract CO2 emissions data to assess the environmental impact of different flight options.

Best Tools for Automated Google Flights Data

Here is a list of different tools that can help you automate the data collection process and save a lot of time and potentially resources too:

  1. Flight Analysis by kcelebi
  2. Google Flights Scraper by Bright Data
  3. SerpApi

Let’s start by setting up the environment to build our custom Google Flights scraper.

Step 1: Setting Up Your Python Environment

Before diving into coding, make sure you have a clean Python environment. It’s best to use a virtual environment to keep dependencies isolated.

Create and Activate a Virtual Environment

Open your terminal and execute the following commands:

# Create a virtual environment
python -m venv flights-scraper-env
# Activate the virtual environment
# On Windows:
.\flights-scraper-env\Scripts\activate
# On macOS/Linux:
source flights-scraper-env/bin/activate

Install Necessary Packages

We’ll use Playwright to interact with the Google Flights website, as it can automate actions on dynamic web pages effectively. Tenacity will help with retry mechanisms to enhance reliability.

# Install required packages
pip install playwright tenacity asyncio
# Install Playwright's browser dependencies
playwright install chromium

Step 2: Define Data Structures

To keep our code organized, we’ll define data classes to store the search parameters and the flight data we extract.

from dataclasses import dataclass
from typing import Optional
@dataclass
class SearchParameters:
departure: str
destination: str
departure_date: str
return_date: Optional[str] = None
ticket_type: str = "One way"
@dataclass
class FlightData:
airline: str
departure_time: str
arrival_time: str
duration: str
stops: str
price: str
co2_emissions: str
emissions_variation: str
  • SearchParameters holds the details needed for a flight search.
  • FlightData stores information about each flight, including airline, timings, duration, stops, price, and CO2 emissions.

Step 3: Crafting the Flight Scraper Class

The core of our scraper will reside in a FlightScraper class, which we’ll build in stages.

Define CSS Selectors

Using CSS selectors allows us to target specific elements on the page to extract details such as the airline name, departure time, price, etc.

class FlightScraper:
SELECTORS = {
"airline": "div.sSHqwe.tPgKwe.ogfYpf",
"departure_time": 'span[aria-label^="Departure time"]',
"arrival_time": 'span[aria-label^="Arrival time"]',
"duration": 'div[aria-label^="Total duration"]',
"stops": "div.hF6lYb span.rGRiKd",
"price": "div.FpEdX span",
"co2_emissions": "div.O7CXue",
"emissions_variation": "div.N6PNV",
}

Simulate Filling Out the Search Form

We’ll use Playwright’s asynchronous functions to mimic a user filling in the search parameters on Google Flights.

async def _fill_search_form(self, page, params: SearchParameters) -> None:
ticket_type_div = page.locator("div.VfPpkd-TkwUic[jsname='oYxtQd']").first
await ticket_type_div.click()
await page.locator("li").filter(has_text=params.ticket_type).nth(0).click()
from_input = page.locator("input[aria-label='Where from?']")
await from_input.fill(params.departure)
to_input = page.locator("input[aria-label='Where to?']")
await to_input.fill(params.destination)
date_input = page.locator("input[aria-label='Departure date']")
await date_input.fill(params.departure_date)

Load All Available Flights

Google Flights often requires users to click “Show more flights” to reveal all options. Automate this with a loop that clicks this button until no more flights are displayed.

async def _load_all_flights(self, page) -> None:
while True:
try:
more_button = await page.wait_for_selector(
'button[aria-label*="more flights"]', timeout=5000
)
if more_button:
await more_button.click()
await page.wait_for_timeout(2000)
else:
break
except:
break

Extract Flight Data

Once all flights are visible, loop through each flight’s elements and retrieve details using the CSS selectors.

async def _extract_flight_data(self, page) -> list[FlightData]:
await page.wait_for_selector("li.pIav2d", timeout=30000)
flights = await page.query_selector_all("li.pIav2d")
flights_data = []
for flight in flights:
flight_info = {}
for key, selector in self.SELECTORS.items():
element = await flight.query_selector(selector)
flight_info[key] = await self._extract_text(element)
flights_data.append(FlightData(**flight_info))
return flights_data

Step 4: Implement Retry Logic for Reliability

To make our scraper more resilient, we’ll implement a retry mechanism using the tenacity library, which retries the function if it encounters certain errors.

from tenacity import retry, stop_after_attempt, wait_fixed
@retry(stop=stop_after_attempt(3), wait=wait_fixed(5))
async def search_flights(self, params: SearchParameters) -> list[FlightData]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://www.google.com/flights")
await self._fill_search_form(page, params)
flights = await self._extract_flight_data(page)
await browser.close()
return flights

Step 5: Save Results to a JSON File

Once you’ve scraped the data, store it in a JSON file for easy retrieval and analysis.

import json
from datetime import datetime
def save_results(self, flights: list[FlightData], params: SearchParameters) -> str:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"flight_results_{params.departure}_{params.destination}_{timestamp}.json"
output_data = {
"search_parameters": vars(params),
"flights": [vars(flight) for flight in flights],
}
with open(filename, "w", encoding="utf-8") as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
return filename

Step 6: Running the Scraper

Here’s how to initiate the scraper with sample search parameters.

import asyncio
async def main():
scraper = FlightScraper()
params = SearchParameters(
departure="LAX",
destination="JFK",
departure_date="2024–12–01",
ticket_type="One way"
)
try:
flights = await scraper.search_flights(params)
scraper.save_results(flights, params)
print("Flights scraped successfully.")
except Exception as e:
print(f"Error during flight search: {str(e)}")
if __name__ == "__main__":
asyncio.run(main())

Overcoming Common Scraping Challenges: IP Blocking and CAPTCHA

Scraping Google Flights comes with unique challenges. Here’s how to tackle them:

  • IP Blocking: Use rotating proxies to avoid detection from Google’s anti-scraping measures.
  • CAPTCHAs: Implement CAPTCHA-solving solutions such as Bright Data’s Web Unlocker, which can automatically bypass CAPTCHA challenges.

Conclusion

Scraping Google Flights offers deep insights into travel patterns, prices, and airline options. By setting up a scraper in Python with Playwright and following best practices for reliability, you can extract valuable flight data for personal or business use. To enhance efficiency, leverage proxy and CAPTCHA-handling tools significantly when scaling up. This guide has equipped you with the foundational steps, and with Python, your Google Flights scraping capabilities are limitless!

Similar Posts