Search code examples
gogoquery

goquery- Concatenate a tag with the one that follows


For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

So this is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

With the code starting off like this,

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read <h1..."text" with:

h3_tag := doc.Find("h3 .text")

I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:

doc.Find("p .text").Before("h3 .text")

^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".

Then, I found out how to get a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.

If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.

EDIT

I found out something else I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "\n\n")
})

^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find("h1+p").

I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.


Solution

  • I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.

    h1+p
    

    This returns h1 tags which has p tag in neighbor.