I am trying to print all links in a HTML Doc using a function which accepts *html.Node as an argument. I am new to both Golang and the *html.Node datatype, I have never worked with them before.
func visit(links []string, n *html.Node) []string {
if n == nil {
return links
}
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
links = append(links, a.Val)
}
}
}
if i == 0 {
i++
return visit(links, n.FirstChild)
}
return visit(links, n.NextSibling)
}
The aim of the if block which checks whether i==0
is to ensure that return visit(links, n.FirstChild)
only runs once (the first time) and return visit(links, n.NextSibling)
runs on the following iterations. However, links
never gets appended, and always returns an empty slice. I do not understand the error in my code.
The code works perfectly fine when using a for loop, but breaks when I try to use recursion.
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = visit(links, c)
}
Your code doesn't work as in it takes the first child of the document, which is html
element and then it takes its sibling element which is nil
thus causing the function to end with empty slice of links.
To explain in detail: Here's an example code,
package main
import (
"fmt"
"log"
"strings"
"golang.org/x/net/html"
)
var i int = 0
func visit(links []string, n *html.Node) []string {
if n == nil {
return links
}
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
links = append(links, a.Val)
}
}
}
if i == 0 {
i++
return visit(links, n.FirstChild)
}
return visit(links, n.NextSibling)
}
func main() {
s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
log.Fatal(err)
}
links := visit([]string{}, doc)
fmt.Println(links)
}
1st call to visit
,
Arguments:
links = []
n = DocumentNode
In the first call, i=0, so it makes a recursive call to visit
with the first child of the document node.
2nd call to visit
,
Arguments:
links = []
n = ElementNode (n.Data = "html")
In the 2nd call, n
is the html
element node. Now a 3rd call to visit
is made with the next sibling of the html
element node.
And this is where the problem lies.
There is no sibling for html
element node, so n
will be nil
.
3rd call to visit
,
Arguments:
links = []
n = nil
So, now all the function 3 function calls that were called recursively will return and the flow of execution will go back to main
and hence the slice links
will remain empty.
Hope you understood.
The proper way to code this functionality is by the loop you shared in your question, like this,
package main
import (
"fmt"
"log"
"strings"
"golang.org/x/net/html"
)
func visit(links []string, n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
links = append(links, a.Val)
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = visit(links, c)
}
return links
}
func main() {
s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
log.Fatal(err)
}
links := visit([]string{}, doc)
fmt.Println(links)
}
Here, the loop helps to recursively find the links by checking each and every HTML element's children. If one of the HTML elements have no siblings, then it would simply move on to its parent's next sibling element and check