How to Use Geziyor for Web Scraping?
In this guide, I’ll walk you through setting up Geziyor, using it to scrape data, and even exporting that data to a CSV file. By the end, you’ll be ready to dive into web scraping with Geziyor!
What is Geziyor?
Geziyor is a Golang-based framework designed specifically for web scraping and crawling. It’s made for developers who want to extract data from websites without manually handling HTTP requests, HTML parsing, or concurrency issues. Geziyor uses Go’s goroutines to handle concurrent requests, allowing you to scrape multiple web pages at once efficiently. It also includes built-in features for managing cookies, caching content, and exporting data to various formats.
Why Use Geziyor?
Geziyor offers several benefits that make it an attractive choice for web scraping:
- Concurrent Scraping: Geziyor simplifies the process of scraping multiple pages simultaneously using Go’s goroutines.
- Easy HTML Parsing: Geziyor uses a built-in HTML parser that supports CSS selectors, allowing you to easily extract data from HTML elements.
- Customizable: You can modify the framework to suit your needs by adding custom logic for specific scraping tasks.
- Data Export: Geziyor supports automatic data export to various formats, including CSV and JSON.
- Cache and Session Management: Geziyor offers content caching and session management to handle repeated scraping more efficiently.
Now, let’s dive into how to use Geziyor to scrape data from websites.
Setting Up Geziyor for Web Scraping
Before you can start scraping websites with Geziyor, you need to set up your environment. Geziyor requires Go 1.22 or later, so make sure you have the latest version of Go installed. You can download it from the official Go website.
Step 1: Create a Go Project
Open your terminal and create a new directory for your scraping project. Navigate to that directory and initialize a Go project.
mkdir my_scraper
cd my_scraper
go mod init scraper
Step 2: Install Geziyor
To use Geziyor in your project, you need to install it using the go get command:
go get -u github.com/geziyor/geziyor
This command will download and install the Geziyor framework into your project.
Step 3: Create the Scraper Code
Once Geziyor is installed, create a new file called scraper.go in your project directory. This file will contain the code for scraping the target website.
Step 4: Make Your First Request to Get HTML
To begin, let’s create a simple scraper that makes an HTTP request to a website and prints out the raw HTML content of the page. In your scraper.go file, add the following code:
package main
import (
"fmt"
"github.com/geziyor/geziyor"
"github.com/geziyor/geziyor/client"
)
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"https://www.example.com"},
ParseFunc: scraper,
RobotsTxtDisabled: true,
}).Start()
}
func scraper(g *geziyor.Geziyor, r *client.Response) {
fmt.Println("HTML Content:", r.HTMLDoc.Text())
}
In this code:
- StartURLs specifies the URL you want to scrape.
- ParseFunc is the function that processes the HTML content returned by the request.
- RobotsTxtDisabled: true tells Geziyor to ignore robots.txt files, which are often used to block web crawlers.
Run this code by executing the following command in your terminal:
go run scraper.go
This will fetch the page’s HTML content and print it in the terminal.
Step 5: Extract Specific Data from the HTML
Now that you know how to get the full HTML content, let’s extract specific data from the page. Let’s assume you want to extract product names and prices from an e-commerce website.
To extract specific data, use CSS selectors. Let’s modify the scraper function to extract the names and prices of products. Here’s an example:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/geziyor/geziyor"
"github.com/geziyor/geziyor/client"
)
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"https://www.example.com/products"},
ParseFunc: scraper,
RobotsTxtDisabled: true,
}).Start()
}
func scraper(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.product").Each(func(i int, s *goquery.Selection) {
name := s.Find("h2.product-name").Text()
price := s.Find("span.product-price").Text()
fmt.Println("Product:", name, "Price:", price)
})
}
In this code:
- We use the goquery package to parse the HTML.
- The Find function with a CSS selector allows you to extract the product name and price from each product on the page.
Step 6: Export Data to a CSV File
Geziyor also makes it easy to export the scraped data to a CSV file. Let’s modify the scraper to save the extracted product data into a CSV file.
First, import the geziyor/export module, then add the Exporters option to the Geziyor object:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/geziyor/geziyor"
"github.com/geziyor/geziyor/client"
"github.com/geziyor/geziyor/export"
)
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"https://www.example.com/products"},
ParseFunc: scraper,
RobotsTxtDisabled: true,
Exporters: []export.Exporter{
&export.CSV{FileName: "products.csv"},
},
}).Start()
}
func scraper(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.product").Each(func(i int, s *goquery.Selection) {
name := s.Find("h2.product-name").Text()
price := s.Find("span.product-price").Text()
g.Exports <- map[string]interface{}{
"Name": name,
"Price": price,
}
})
}
Here:
- The CSV exporter writes the extracted product data to a file called products.csv in the root directory of your project.
- Each product’s name and price are written in the CSV file.
Step 7: Handle Pagination and Scrape Multiple Pages
Many websites have multiple pages of content that need to be scraped. Geziyor makes it easy to follow pagination links and scrape multiple pages.
Let’s modify the scraper function to follow the “Next” button on the page, which usually links to the next set of products. Here’s how you can implement pagination:
func scraper(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.product").Each(func(i int, s *goquery.Selection) {
name := s.Find("h2.product-name").Text()
price := s.Find("span.product-price").Text()
g.Exports <- map[string]interface{}{
"Name": name,
"Price": price,
}
})
// Follow the "Next" button link
if href, ok := r.HTMLDoc.Find("a.next").Attr("href"); ok {
g.Get(r.JoinURL(href), scraper)
}
}
In this code, after scraping the current page, Geziyor checks if there is a “Next” button (a.next). If the button exists, it follows the link to the next page and continues scraping.
Step 8: Handle JavaScript-Rendered Pages
Some websites use JavaScript to load content dynamically. Geziyor doesn’t support JavaScript rendering out of the box, but you can use the GetRendered function to wait for JavaScript to load the content.
Here’s an example of how to scrape a JavaScript-rendered page:
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartRequestsFunc: requestFunc,
ParseFunc: scraper,
RequestDelay: 10,
}).Start()
}
func requestFunc(g *geziyor.Geziyor) {
g.GetRendered("https://www.example.com/javascript-page", g.Opt.ParseFunc)
}
func scraper(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.product-info").Each(func(i int, s *goquery.Selection) {
fmt.Println("Product:", s.Find(".product-name").Text(), "Price:", s.Find(".product-price").Text())
})
}
In this code:
GetRendered waits for the JavaScript content to be loaded before continuing with the scraping.
To further enhance your scraping reliability, consider integrating Bright Data’s residential proxies into your Geziyor workflow. These proxies can help bypass anti-bot measures and ensure high-speed, uninterrupted data extraction, even from the toughest websites. Interested in other providers? Go over my list of the best residential proxies.
Conclusion
Geziyor is a powerful and easy-to-use framework for web scraping in Golang. With its concurrent scraping capabilities, simple API, and built-in HTML parsing, you can quickly set up and start scraping data from websites. You can also export the data in formats like CSV and JSON and handle complex tasks such as pagination and JavaScript rendering. If you’re working with Go and need a fast, scalable web scraping solution, Geziyor is definitely worth exploring.