Go Web Scraper: Build and Optimize HTML Parsers

Leapcell - Mar 3 - - Dev Community

Image description

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Installation and Usage of Goquery

Installation

Execute:

go get github.com/PuerkitoBio/goquery
Enter fullscreen mode Exit fullscreen mode

Import

import "github.com/PuerkitoBio/goquery"
Enter fullscreen mode Exit fullscreen mode

Load the Page

Take the IMDb Popular Movies page as an example:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }
Enter fullscreen mode Exit fullscreen mode

Get the Document Object

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }
    // Other creation methods
    // doc, err := goquery.NewDocumentFromReader(reader io.Reader)
    // doc, err := goquery.NewDocument(url string)
    // doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))
Enter fullscreen mode Exit fullscreen mode

Select Elements

Element Selector

Select based on basic HTML elements. For example, dom.Find("p") matches all p tags. It supports chained calls:

ele.Find("h2").Find("a")
Enter fullscreen mode Exit fullscreen mode

Attribute Selector

Filter elements by element attributes and values, with multiple matching methods:

Find("div[my]")        // Filter div elements with the my attribute
Find("div[my=zh]")     // Filter div elements whose my attribute is zh
Find("div[my!=zh]")    // Filter div elements whose my attribute is not equal to zh
Find("div[my|=zh]")    // Filter div elements whose my attribute is zh or starts with zh-
Find("div[my*=zh]")    // Filter div elements whose my attribute contains the string zh
Find("div[my~=zh]")    // Filter div elements whose my attribute contains the word zh
Find("div[my$=zh]")    // Filter div elements whose my attribute ends with zh
Find("div[my^=zh]")    // Filter div elements whose my attribute starts with zh
Enter fullscreen mode Exit fullscreen mode

parent > child Selector

Filter the child elements under a certain element. For example, dom.Find("div>p") filters the p tags under the div tag.

element + next Adjacent Selector

Use it when the elements are irregularly selected, but the previous element has a pattern. For example, dom.Find("p[my=a]+p") filters the adjacent p tags whose my attribute value of the p tag is a.

element~next Sibling Selector

Filter the non-adjacent tags under the same parent element. For example, dom.Find("p[my=a]~p") filters the sibling p tags whose my attribute value of the p tag is a.

ID Selector

It starts with # and precisely matches the element. For example, dom.Find("#title") matches the content with id=title, and you can specify the tag dom.Find("p#title").

ele.Find("#title")
Enter fullscreen mode Exit fullscreen mode

Class Selector

It starts with . and filters the elements with the specified class name. For example, dom.Find(".content1"), and you can specify the tag dom.Find("div.content1").

ele.Find(".title")
Enter fullscreen mode Exit fullscreen mode

Selector OR (|) Operation

Combine multiple selectors, separated by commas. Filtering is done if any one of them is satisfied. For example, Find("div,span").

func main() {
    html := `<body>
                <div lang="zh">DIV1</div>
                <span>
                    <div>DIV5</div>
                </span>
            </body>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Html())
    })
}
Enter fullscreen mode Exit fullscreen mode

Filters

:contains Filter

Filter elements that contain the specified text. For example, dom.Find("p:contains(a)") filters the p tags that contain a.

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
    fmt.Println(selection.Text())
})
Enter fullscreen mode Exit fullscreen mode

:has(selector)

Filter elements that contain the specified element nodes.

:empty

Filter elements that have no child elements.

:first-child and :first-of-type Filters

Find("p:first-child") filters the first p tag; first-of-type requires it to be the first element of that type.

:last-child and :last-of-type Filters

The opposite of :first-child and :first-of-type.

:nth-child(n) and :nth-of-type(n) Filters

:nth-child(n) filters the nth element of the parent element; :nth-of-type(n) filters the nth element of the same type.

:nth-last-child(n) and :nth-last-of-type(n) Filters

Calculate in reverse order, with the last element being the first one.

:only-child and :only-of-type Filters

Find(":only-child") filters the only child element in the parent element; Find(":only-of-type") filters the only element of the same type.

Get Content

ele.Html()
ele.Text()
Enter fullscreen mode Exit fullscreen mode

Traversal

Use the Each method to traverse the selected elements:

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
    href, _ := elA.Attr("href")
    fmt.Println(href)
})
Enter fullscreen mode Exit fullscreen mode

Built-in Functions

Array Positioning Functions

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection
Enter fullscreen mode Exit fullscreen mode

Extended Functions

Add...()
AndSelf()
Union()
Enter fullscreen mode Exit fullscreen mode

Filtering Functions

End()
Filter...()
Has...()
Intersection()
Not...()
Enter fullscreen mode Exit fullscreen mode

Loop Traversal Functions

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)
Enter fullscreen mode Exit fullscreen mode

Document Modification Functions

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()
Enter fullscreen mode Exit fullscreen mode

Attribute Manipulation Functions

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()
Enter fullscreen mode Exit fullscreen mode

Node Search Functions

Contains()
Is...()
Enter fullscreen mode Exit fullscreen mode

Document Tree Traversal Functions

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()
Enter fullscreen mode Exit fullscreen mode

Type Definitions

Document
Selection
Matcher
Enter fullscreen mode Exit fullscreen mode

Helper Functions

NodeName
OuterHtml
Enter fullscreen mode Exit fullscreen mode

Examples

Getting Started Example

func main() {
    html := `<html>
            <body>
                <h1 id="title">O Captain! My Captain!</h1>
                <p class="content1">
                O Captain! my Captain! our fearful trip is done,
                The ship has weather’d every rack, the prize we sought is won,
                The port is near, the bells I hear, the people all exulting,
                While follow eyes the steady keel, the vessel grim and daring;
                </p>
            </body>
            </html>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("p").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Text())
    })
}
Enter fullscreen mode Exit fullscreen mode

Example of Crawling IMDb Popular Movie Information

package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) {
        title := selection.Text()
        href, _ := selection.Attr("href")
        fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href)
    })
}
Enter fullscreen mode Exit fullscreen mode

The above examples extract the movie names and link information from the IMDb popular movies page. In actual use, you can adjust the selectors and processing logic according to your needs.

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Finally, I would like to recommend the best platform for deploying Go services: Leapcell

Image description

1. Multi-Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay-as-you-go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real-time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto-scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .