Search code examples
htmlrubyparsingnokogiriopen-uri

Nokogiri parsing cut content between elements


I googled half of internet searching help in my case.

So, what I need:

I have HTML structure for parsing like that:

<div class="foo">
  <div class='bar' dir='ltr'>
    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_1' class='dx'>
          <a href='/bar32560'>1</a>
        </span>
        Neque porro 
        <a href='/xyz' class='mr'>+</a>
        quisquam est 
        <a href='/xyz' class='mr'>+</a>
        qui. 
      </p>
    </div>
    <div id='p2' class='par'>
      <p class='sb'>
        <span id='dc_1_2' class='dx'>
          <a href='/foo12356'>2</a>
        </span>
        dolorem ipsum 
        <a href='/xyz' class='mr'>+</a>
        quia dolor sit amet, 
        <a href='/xyz' class='mr'>+</a>
        consectetur, adipisci velit.
      </p>
    </div>
    <div id='p3' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>3</a>
        </span>
        Neque porro quisquam 
        <a href='/xyz' class='mr'>+</a>
        est qui dolorem ipsum quia dolor sit 
        <a href='/xyz' class='mr'>+</a>
        amet, t.
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_4' class='dx'>
          <a href='/barefoot4135'>4</a>
        </span>
        consectetur, 
        <a href='/xyz' class='mr'>+</a>
        adipisci veli.
        <span id='dc_1_5' class='dx'>
          <a href='/barfoo05123'>5</a>
       </span>
       Neque porro 
       <a href='/xyz' class='mr'>+</a>
       quisquam est
       <a href='/xyz' class='mr'>+</a>
       qui.
     </p>
   </div>
 </div>
</div>

What I need (IN ENGLISH): scrape each paragraph BUT I need final scraped text object content in form:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.

Code what i use for now:

page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
  body = node.text
end

My result is like:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.

So this scrape whole text from div paragraph class 'par'. I need to scrape whole text after each span with his content - numbers. Or cut those div's before each span.

I need something like:

SPAN.text + P.text - a.mr

I dunno… how to do this

Please help me with this parsing. I need scrape after/before each span - I guess.

Please help, I've tried everything what i found.


EDIT DUCK @Duck1337:

I use followed code:

def verses
    page = Nokogiri::HTML(open(url))
    i=0
    x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

I need this because I parse a big website with text. There is few more methods. So my final output looks like:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.

But if I have single werse with multiple sentences then your code split it by every sentence. So this is to much split.

For example:

    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>1</a>
        </span>
        Neque porro quisquam. Est qui dolorem
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>2</a>
        </span>
        est qui dolorem ipsum quia dolor sit. 
        <a href='/xyz' class='mr'>+</a>
        amet, t.

Your code split like that:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.

Hope you what I mean. Really BIG Thanks to you for that. If you can modify this it will be great!


EDIT: @KARDEIZ

Thanks for answer! When I use your code inside of my method: It parsed really radom stuff.

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  #page.css(".mr").remove
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

The output is like:

Saved record with: book: 1, chapter: 1, verse: 1, body:  <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body:  <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body:  <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)

As you can see, I dunno how it make such a mess ;] Can you do something more with that? Thanks a lot!


Regards!


Solution

  • Try something like:

    x.xpath("//div[contains(@class, 'par')]//span").map do |node|
      out = node.content.strip
      if following = node.at_xpath('following-sibling::text()')
        out << ' ' << following.content.strip
      end
      out
    end
    

    The following-sibling::text() XPATH will get the first text node after the span.

    EDIT

    I think this does what you want:

    html.xpath("//div[contains(@class, 'par')]//span").map do |node|
      node.content.strip.tap do |out|
        while nn = node.next
          break if nn.name == 'span'
          out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
          node = nn
        end
      end  
    end
    

    outputs:

    [
      "1 Neque porro quisquam est qui.",
      "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
      "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
      "4 consectetur, adipisci veli.",
      "5 Neque porro quisquam est qui."
    ]
    

    It's also possible to do this with pure XPath (see XPath axis, get all following nodes until), but this solution is more simple from a coding perspective.

    EDIT 2

    Try this:

    def verses
      page = Nokogiri::HTML(open(url))
      i=0
      page.xpath("//div[contains(@class, 'par')]//span").map do |node|
        body = node.content.strip.tap do |out|
          while nn = node.next
            break if nn.name == 'span'
            out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
            node = nn
          end
        end
        i+=1
        VerseSource.new(body, book_num, number, i)
      end
    end