Web Scraping with Jupyter Notebooks

Web Scraping with Jupyter Notebooks

In this guide, I’m going to show you how to use 🌠Jupyter Notebooks🌠 for web scraping. We’ll walk through everything step by step — from writing your scraping code to understanding the data and creating valuable visualizations. Trust me, it’ll make your scraping tasks much smoother and more efficient! Let’s dive in.

What are Jupyter Notebooks?

Jupyter Notebooks are interactive documents that combine live code, visualizations, and narrative text in a single, shareable document. They are widely used in data science, machine learning, and research fields for their real-time flexibility and ability to document work. In Jupyter Notebooks, you can write Python code in small cells, run them independently, and observe the output immediately.

These notebooks support various languages, but Python is the most commonly used language due to its simplicity and the many libraries available. Jupyter allows you to interact with your data, create plots, and debug code within one notebook. It’s especially useful for data exploration, making it an ideal tool for web scraping.

No-Code Alternatives

You can go over my article that lists the top no-code scrapers. Or, you can go over this TL;DR:

  1. Bright Data — Enterprise-grade tool for high-volume data extraction.
  2. Octoparse — Flexible multi-tool with free and premium plans.
  3. ParseHub — Beginner-friendly, good free plan, some bugs.
  4. Apify — Pre-made templates, great for niche use cases.
  5. Web Scraper — Localized scraping via browser extension, user-friendly.

I am not affiliated with any of them. Now, let’s go back to Jupyter 🌟!

Why Use Jupyter Notebooks for Web Scraping?

Jupyter Notebooks are especially well-suited for web scraping for several reasons:

Interactive Development

Jupyter’s interactive nature allows you to write and execute code in small chunks called cells. This means you can test individual parts of your scraping code, inspect the results, and adjust as needed. This iterative approach helps you quickly identify and fix issues.

Documentation and Explanation

With the Markdown feature in Jupyter Notebooks, you can document each process step in plain text, explain your code logic, and provide annotations. Revisiting it later makes your work more understandable to others (or yourself). It’s an excellent way to create tutorials and share knowledge.

Data Analysis and Visualization

Once you’ve scraped the data, Jupyter Notebooks allow you to process, analyze, and visualize it within the same environment. You can manipulate your data and create insightful visualizations using libraries such as pandas, matplotlib, and seaborn.

Reproducibility and Sharing

Jupyter Notebooks can be easily shared as .ipynb files, allowing others to view and run the code on their systems. You can also export notebooks to other formats, such as HTML or PDF, to share your results more polishedly.

How to Use Jupyter Notebooks for Web Scraping?

Before you can start scraping, you need to set up your environment. Below is a simple guide to get you started.

Step 1: Install Python and Jupyter

Make sure you have Python 3.6 or higher installed on your machine. If not, you can download it from the official Python website.

Once you install Python, you can install Jupyter Notebooks using pip, Python’s package manager.

pip install jupyter

Step 2: Create a Virtual Environment

It’s a good practice to create a virtual environment for your project to keep dependencies organized. You can create a new virtual environment with the following command:

python -m venv scraper

Then, activate the environment:

  • Windows: scraper\Scripts\activate
  • macOS/Linux: source scraper/bin/activate

Step 3: Install Required Libraries

Next, install the necessary libraries for web scraping and data analysis. These libraries include requestsBeautifulSouppandas, and seaborn for scraping and visualizing the data.

pip install requests beautifulsoup4 pandas seaborn

Once the libraries are installed, you can launch Jupyter Notebook with:

jupyter notebook

This command will open the Jupyter dashboard in your browser, where you can create a new notebook and start writing your web scraping code.

Step-by-Step Web Scraping with Jupyter Notebooks

Now that everything is set up, let’s dive into the web scraping process using Jupyter Notebooks.

Step 1: Define the Target Website

For this tutorial, let’s scrape data from a website called Worldometer. This website provides detailed statistics on various global topics, including CO2 emissions.

The page we want to scrape contains a table about CO2 emissions in the United States.

Step 2: Sending HTTP Requests to Fetch Data

To scrape data, you first need to send an HTTP request to the website’s server. We’ll use the requests library for this. Here’s how you can fetch the page content:

import requests
# URL of the target website
url = 'https://www.worldometers.info/co2-emissions/us-co2-emissions/'
# Send a GET request to the website
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
print('Successfully fetched the webpage!')
else:
print('Failed to retrieve the page')

Step 3: Parse the HTML Content

Once we have the page content, we need to extract the data we are interested in. We will use BeautifulSoup to parse the HTML and locate the data table.

from bs4 import BeautifulSoup
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table containing the CO2 emissions data
table = soup.find('table')

Step 4: Extract Table Data

Next, we need to extract the headers and rows from the table. We can loop through each table row and collect the data into a list.

# Extract the table headers
headers = [header.text.strip() for header in table.find_all('th')]
# Extract the rows of data
rows = []
for row in table.find_all('tr')[1:]: # Skip the header row
cells = row.find_all('td')
row_data = [cell.text.strip() for cell in cells]
rows.append(row_data)
# Print the headers and first row to check the data
print(headers)
print(rows[0])

Step 5: Save the Data to a CSV File

Once you have the data in a structured format, you can save it to a CSV file for later analysis. We’ll use Python’s built-in csv module for this.

import csv
# Define the output CSV file
csv_file = 'co2_emissions.csv'
# Write the data to the CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(headers) # Write headers
writer.writerows(rows) # Write rows
print(f"Data has been saved to {csv_file}")

Step 6: Analyze the Data

Now that the data has been saved in a CSV file, we can use pandas to load it into a DataFrame for easy analysis.

import pandas as pd
# Load the data into a pandas DataFrame
df = pd.read_csv(csv_file)
# Display the first few rows of the data
df.head()

Step 7: Visualize the Data

Finally, let’s visualize the data using seaborn and matplotlib. For example, we can create a line plot to show how CO2 emissions have changed over the years.

import seaborn as sns
import matplotlib.pyplot as plt
# Convert 'Fossil CO2 Emissions' column to numeric
df['Fossil CO2 Emissions (tons)'] = df['Fossil CO2 Emissions (tons)'].str.replace(',', '').astype(float)
# Ensure the 'Year' column is numeric
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
# Sort the data by year
df = df.sort_values(by='Year')
# Create a line plot of CO2 emissions over the years
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x='Year', y='Fossil CO2 Emissions (tons)', marker='o')
plt.title('CO2 Emissions in the U.S. Over the Years', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Fossil CO2 Emissions (tons)', fontsize=12)
plt.grid(True)
plt.show()

Use Cases for Jupyter Notebooks in Web Scraping

Jupyter Notebooks are ideal for many web scraping scenarios, especially when combining scraping, analysis, and visualization in one place. Here are some use cases:

Educational Purposes

Jupyter Notebooks are great for creating interactive tutorials. You can guide beginners through web scraping, explaining the code and showing the results in real-time.

Data Exploration and Analysis

Jupyter provides an excellent environment for data scientists or researchers to explore scraped data. You can quickly iterate on your code, clean the data, and visualize trends or patterns.

Prototyping and Testing

If you’re developing a web scraping tool or script, Jupyter allows you to test small pieces of code quickly. This iterative process can save time during development.

Conclusion

Web scraping with Jupyter Notebooks is a powerful approach that combines data collection, analysis, and visualization in a single environment. Jupyter Notebooks’s interactive nature allows you to test, debug, and document your code easily, making it an excellent tool for web scraping tasks.

However, for large-scale scraping or automation, you may want to look into other solutions. Nonetheless, for many tasks, Jupyter Notebooks provide a convenient, flexible, and efficient platform for scraping and analyzing web data.

Got any questions? Let me know in the comments!

Similar Posts