How to Build an AI Scraper With Crawl4AI and DeepSeek

How to Build an AI Scraper With Crawl4AI and DeepSeek

In this guide, I’ll show you how to build an AI-powered scraper using Crawl4AI and DeepSeek. Crawl4AI is a flexible, open-source scraping tool that works with AI models. DeepSeek is a powerful AI model that extracts structured data from unstructured web pages. Combining these two will create a scraper that can navigate websites intelligently and extract clean, organized data — without breaking a sweat!

What is Crawl4AI?

Crawl4AI is an open-source, AI-ready web scraper designed to work with large language models (LLMs). Unlike traditional scrapers, it does not rely on fixed HTML parsing rules. Instead, it can extract structured data using AI models like DeepSeek.

Features of Crawl4AI:

  • Designed for LLMs: It produces structured data optimized for AI training and retrieval-augmented generation (RAG).
  • Smart Browser Control: It manages browser sessions, proxies, and custom hooks.
  • AI-powered Parsing: Uses heuristic intelligence to extract structured information.
  • Open Source: No API keys required; deployable on Docker and cloud platforms.

Why Use DeepSeek With Crawl4AI?

DeepSeek is an advanced, open-source AI model that processes text efficiently. When combined with Crawl4AI, it allows dynamic content extraction without hardcoded parsing rules. This is particularly useful for:

  • Sites with frequently changing structures: AI adapts to new layouts automatically.
  • Extracting unstructured content: AI models can analyze free text, blog posts, or customer reviews.
  • Handling different page formats: Many websites use multiple templates for their content, which traditional scrapers struggle to manage.

Web Scraping With Craw4AI and DeepSeek: Step-By-Step Guide

Step 1: Set Up the Project

First, create a new project directory and set up a virtual environment.

mkdir ai-scraper
cd ai-scraper
python -m venv venv

Activate the virtual environment:

For macOS/Linux:

source venv/bin/activate

For Windows:

venvScriptsactivate

Step 2: Install Crawl4AI

Install Crawl4AI and its dependencies using pip:

pip install crawl4ai

Run the setup command to install browser dependencies:

crawl4ai-setup

This command installs Playwright browsers and sets up a database for caching.

Step 3: Create a Scraper File

Create a new file called scraper.py inside the project folder.

touch scraper.py

Open scraper.py in a text editor and add the basic async structure:

import asyncio
async def main():
# Scraper logic will go here
pass
if __name__ == "__main__":
asyncio.run(main())

Step 4: Configure the Scraper

Inside scraper.py, import the necessary Crawl4AI components:

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

Define a basic web scraping function:

async def main():
# Browser settings
browser_config = BrowserConfig(headless=True)
# Crawler settings
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# Initialize and run the AI-powered scraper
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://www.example.com", config=crawler_config)
# Display the extracted data
print(f"Extracted Data:n{result.markdown[:1000]}")

Step 5: Handle Website Restrictions

Some websites block scrapers by detecting bot-like behavior. If your request gets a 403 Forbidden error, it means the site has blocked your scraper.

To bypass restrictions, we can use Bright Data’s Web Unlocker API. This service handles proxy rotation and CAPTCHA solving automatically.

Set Up Web Unlocker API

  1. Create a Bright Data account.
  2. Navigate to Proxies & Scraping in the dashboard.
  3. Activate Web Unlocker API.
  4. Copy the proxy credentials and store them in a .env file:
PROXY_SERVER=https://proxy.brightdata.com:22225
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password

Integrate Proxy Into the Scraper

Modify scraper.py to use the proxy settings:

import os
from dotenv import load_dotenv
load_dotenv()
# Proxy configuration
proxy_config = {
"server": os.getenv("PROXY_SERVER"),
"username": os.getenv("PROXY_USERNAME"),
"password": os.getenv("PROXY_PASSWORD")
}
browser_config = BrowserConfig(headless=True, proxy_config=proxy_config)

Now, the scraper will route traffic through Bright Data’s network to avoid detection.

Handling Site Restrictions at Scale

As your scraper grows more advanced, you may run into common challenges like IP blocks, CAPTCHAs, or JavaScript-heavy pages. To keep your scraper running smoothly, consider using tools that offer proxy rotation, browser emulation, and automated bypassing of anti-bot systems.

For example, integrating a proxy-based solution with built-in CAPTCHA handling can help maintain access to even the most protected websites. This ensures your AI-powered scraper remains reliable and scalable across a wide range of targets. My agency is using mostly Bright Data’s products for web scraping.

Step 6: Use DeepSeek for AI Data Extraction

We need DeepSeek, an AI model that understands and structures data to extract meaningful information.

Get a Groq API Key

  1. Sign up on GroqCloud.
  2. Create an API key under API Keys.
  3. Store it in your .env file:
LLM_API_TOKEN=your_groq_api_key
LLM_MODEL=groq/deepseek-r1-distill-llama-70b

Define the Extraction Schema

Create a models/ directory and a file models/data_schema.py:

from pydantic import BaseModel
class ExtractedData(BaseModel):
title: str
description: str
image_url: str
Use DeepSeek for AI Parsing
Modify scraper.py to include AI-powered extraction:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from models.data_schema import ExtractedData
extraction_strategy = LLMExtractionStrategy(
provider=os.getenv("LLM_MODEL"),
api_token=os.getenv("LLM_API_TOKEN"),
schema=ExtractedData.model_json_schema(),
extraction_type="schema",
instruction="Extract the title, description, and image URL from the content.",
input_format="markdown"
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy
)

Now, the AI model will analyze the page content and structure the data automatically.

Step 7: Save Extracted Data

Modify scraper.py to save the extracted data as a JSON file:

import json
async def main():
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://www.example.com", config=crawler_config)
# Parse AI-extracted data
extracted_data = json.loads(result.extracted_content)
# Save to JSON
with open("output.json", "w", encoding="utf-8") as f:
json.dump(extracted_data, f, indent=4)
if __name__ == "__main__":
asyncio.run(main())

Final Steps: Running the Scraper

Run the scraper with:

python scraper.py

This will:

  1. Extract data from the website.
  2. Use AI to structure the content.
  3. Save the data to a JSON file.

So, we just built an AI-powered web scraper using Crawl4AI and DeepSeek. Unlike regular scrapers, this one is smarter and more flexible. It can adapt to website changes, bypass anti-bot protections, and extract data without complex parsing rules.

Conclusion

With this setup, you can easily scrape even the most protected websites. No more broken scrapers every time a site updates! Now, you have a powerful tool that makes web scraping faster, smarter, and more reliable.

Go ahead and try it out on different websites. Happy coding and happy scraping!

Similar Posts