Web scraping with GoLang and Colly
Coding Go Programming Tutorials & Methods

Web scraping with Golang – Go and Colly

Welcome to the first episode of the GoLang web scraping tutorial series. In today’s article, we will be scraping a website with Go and extracting data from it with Colly. Let’s start web scraping with GoLang.

We will be using a go library called Colly for today’s task.

Go web scraping: Scrape a website

Before going directly into scraping a site, I wanna share some Go web scraping FAQs for you.

Is go good for web scraping?

Go is the fastest high-level language ever built, it’s a compiled and static type language that could be very important to code efficient, fast and scalable web scrapers/crawlers. Also, Go rarely need any third-party packages to handle communication with web servers. This is why you choose GoLang for web scraping over other programming languages.

What is GO Colly?

Go Colly is an awesome web scraping library that can be used to scrape, extract and crawl any kind of website. With colly, structured data from websites can easily be extracted and can be used for a wide range of applications, like data mining, data processing or archiving.

Requirements: We will be scraping this image-gallery site. Image src and image alt should be scraped. (Get Source Code)

Web scraping in Golang with Colly
Web scraping in Golang with Colly – Process

Structuring data format

We can use struct-type to structure a data format. This data format will be later used to collect/store data within it.

type Images struct {
	Image       string `json:"image"`
	Description string `json:"description"`
}

Also, inside the main() function, we will assign the Images struct-type to a variable.

func main() {
	allImages := make([]Images, 0)
}

Installing colly

Here is how you can go install colly.

go get github.com/gocolly/colly/...

After installing colly, do not forget to import it.

import (
	"github.com/gocolly/colly"
)

Creating a New Colly Collector

Creating a NewCollector is essential. Collector simply sends requests and catches HTML.

collector := colly.NewCollector(
	colly.AllowedDomains("img.webmatrices.com"),
)

Also, the AllowedDomains only allows that specific domain to scrape, if any domain except that is given, that collector won’t execute, and the execution heads over to another line.

Analyzing HTML format

We will be scraping the below chunk of HTML. (View source / View source in GitHub)

<div class="gallery">
  <a target="_blank" href="./site-source-code_files/Bishwas_Bhandari.jpg">
    <img
      src="./site-source-code_files/Bishwas_Bhandari.jpg"
      alt="Subscribe to Developer Bishwas Bhandari"
      width="600"
      height="400"
    />
  </a>
  <div class="desc">Subscribe to Developer Bishwas Bhandari</div>
</div>

Here, the class="gallery" is the div that needs to be scrapped. The gallery-divs contains all the information we are seeking for.

So, the CSS-selector for class="gallery" is .gallery or div.gallery.

And the CSS-selector for the image is div.gallery > a > img.

Parsing HTML with Golang: Colly

It’s easy to parse any HTML chunk with GoLang and Colly. We will use collector.OnHTML function for HTML parsing purpose. It collects multiple HTML chunks with respect to CSS-selector and loops over the collected HTML chunks. It returns an HTML element over the loop, and we can use that returned-HTML-element to extract our data.

collector.OnHTML(".gallery", func(h *colly.HTMLElement) {
    // Parsing, Scraping and data extraction.
}

Here, h is that HTML element, we were talking about in the above paragraph.

Finding elements

As we know div.gallery > a > img > a > img is the data we are seeking for. We already got the div.gallery with the OnHTML function. Now, a > img is what we need.

// Parsing, Scraping and data extraction.

image_element := h.DOM.Find("a > img").Eq(0)

Here, h.DOM.Find function searches for the “a > img” element and returns the list of those elements inside the “h HTML-element“.

And Eq(0) is used to get the first element of the Elements-List.

Data extraction

We got the HTML-element-object of the data we want, now we extract data from it. Image source and the description of the image is what we want.

image, _ := image_element.Attr("src")
description, _ := image_element.Attr("alt")

Now, we should assign the extracted data to the Images struct-type.

images := Images{
	Image:       image,
	Description: description,
}
// Appending images data to allImages 
allImages = append(allImages, images)

Also, we successfully appended the data to allImages.

Sending HTTP requests with Go: Colly

collector.Visit can help us to send HTTP requests to the server.

// Executes on sending request
collector.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.URL.String())
})

// Sends HTTP requests to the server
collector.Visit("https://img.webmatrices.com")

Also, the code inside the collector.OnRequest executes on sending the request to the server.

After sending the request, Golang will execute the collector.OnHTML function.

Saving the extracted data

We can use the below function to save the extracted data.

func writeJSON(data []Images) {
	file, err := json.MarshalIndent(data, "", "  ")
	if err != nil {
		log.Println("Unable to create the JSON file.")
	}
	_ = ioutil.WriteFile("images-data.json", file, 0644)
	fmt.Println("Scraping and Writing successful. Go for Good!")
}

and add writeJSON(allImages) at the end of the main() function.

Or, you can simply print it.

enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", "\t")
enc.Encode(allImages)

Result

Finally, here is our pure, scraped and extracted data.

[
  {
    "image": "/media/Bishwas_Bhandari.jpg",
    "description": "Subscribe to Developer Bishwas Bhandari"
  },
  {
    "image": "/media/Existence_of_1d_object_and_its_motion_with1.gif",
    "description": "Existance of 2d object"
  },
  {
    "image": "/media/7495.jpg",
    "description": "Developer Illustration - Programming Image"
  },
  {
    "image": "/media/linkedin_hashtag_generator.png",
    "description": "linkedin hashtag generator"
  },
  {
    "image": "/media/LinkedIn_Hashtag_Generator_icon.png",
    "description": "LinkedIn Hashtag Generator Icon"
  }
  ...
]

Libraries Used

These are the libraries that’ll be used at the end.

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"log"
	"os"

	"github.com/gocolly/colly"
)

Don’t forget to import all the libraries at the end of the tutorial.

In a nutshell

We’re learned all the basic skills to scrape any website with GoLang, extract data from it and save it in a readable & more usable format. This tutorial also helped us to boost our Go Programming skills.

We suggest you, bookmark our blog and keep checking our updates for Go Programming and Web Scraping.

Quick trick: Press CTRL+D to bookmark our blog.

Resources

These are the all possible resources I can provide to you. Hope it’ll help.

GitHub Repo

Here is the source code you want.

Other Examples

Other resources

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x