Search code examples
htmlgorecursiontreenodes

Recursive Function involving *html.Node to print all links in a HTML Doc


I am trying to print all links in a HTML Doc using a function which accepts *html.Node as an argument. I am new to both Golang and the *html.Node datatype, I have never worked with them before.

func visit(links []string, n *html.Node) []string {
    if n == nil {
        return links
    }
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                links = append(links, a.Val)
            }
        }
    }
    if i == 0 {
        i++
        return visit(links, n.FirstChild)
    }
    return visit(links, n.NextSibling)
}

The aim of the if block which checks whether i==0 is to ensure that return visit(links, n.FirstChild) only runs once (the first time) and return visit(links, n.NextSibling) runs on the following iterations. However, links never gets appended, and always returns an empty slice. I do not understand the error in my code.

The code works perfectly fine when using a for loop, but breaks when I try to use recursion.

for c := n.FirstChild; c != nil; c = c.NextSibling {
        links = visit(links, c)
    }

Solution

  • Your code doesn't work as in it takes the first child of the document, which is html element and then it takes its sibling element which is nil thus causing the function to end with empty slice of links.

    To explain in detail: Here's an example code,

    package main
    
    import (
        "fmt"
        "log"
        "strings"
    
        "golang.org/x/net/html"
    )
    
    var i int = 0
    
    func visit(links []string, n *html.Node) []string {
    
        if n == nil {
            return links
        }
    
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    links = append(links, a.Val)
                }
            }
        }
    
        if i == 0 {
            i++
            return visit(links, n.FirstChild)
        }
    
        return visit(links, n.NextSibling)
    }
    
    func main() {
        s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
    
        doc, err := html.Parse(strings.NewReader(s))
        if err != nil {
            log.Fatal(err)
        }
    
        links := visit([]string{}, doc)
    
        fmt.Println(links)
    }
    

    1st call to visit,
    Arguments:
    links = []
    n = DocumentNode

    In the first call, i=0, so it makes a recursive call to visit with the first child of the document node.

    2nd call to visit,
    Arguments:
    links = []
    n = ElementNode (n.Data = "html")

    In the 2nd call, n is the html element node. Now a 3rd call to visit is made with the next sibling of the html element node. And this is where the problem lies. There is no sibling for html element node, so n will be nil.

    3rd call to visit,
    Arguments:
    links = []
    n = nil

    So, now all the function 3 function calls that were called recursively will return and the flow of execution will go back to main and hence the slice links will remain empty.

    Hope you understood.

    The proper way to code this functionality is by the loop you shared in your question, like this,

    package main
    
    import (
        "fmt"
        "log"
        "strings"
    
        "golang.org/x/net/html"
    )
    
    func visit(links []string, n *html.Node) []string {
    
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    links = append(links, a.Val)
                }
            }
        }
    
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            links = visit(links, c)
        }
    
        return links
    }
    
    func main() {
        s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
    
        doc, err := html.Parse(strings.NewReader(s))
        if err != nil {
            log.Fatal(err)
        }
    
        links := visit([]string{}, doc)
    
        fmt.Println(links)
    }
    

    Here, the loop helps to recursively find the links by checking each and every HTML element's children. If one of the HTML elements have no siblings, then it would simply move on to its parent's next sibling element and check