Search code examples

goquery- Concatenate a tag with the one that follows

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

            <span class="text">Go </span>
            <span class="text">totally </span>
            <span class="post">kicks </span>
            <span class="text">hacks </span>
            <span class="post">its </span>
            <span class="text">debugger </span>
            <span class="text">should </span>
            <span class="post">be </span>
            <span class="text">called </span>
            <span class="post">ogle </span>
            <span class="statement">true</span>

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

So this is what I want it to look like:

            <span class="text">Go totally </span>
            <span class="post">kicks </span>
            <span class="text">hacks </span>
            <span class="post">its </span>
            <span class="text">debugger should </span>
            <span class="post">be </span>
            <span class="text">called </span>
            <span class="post">ogle</span>
            <span class="statement">true</span>

With the code starting off like this,

package main

import (

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read <h1..."text" with:

h3_tag := doc.Find("h3 .text")

I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:

doc.Find("p .text").Before("h3 .text")

^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".

Then, I found out how to get a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.

If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.


I found out something else I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "\n\n")

^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find("h1+p").

I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.


  • I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.


    This returns h1 tags which has p tag in neighbor.