Search code examples
rubyhpricot

How do I extract text from a web page with <br /> tags using Hpricot?


I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>.

require 'hpricot'

text = <<SOME_TEXT
  <a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
  line 1<br />  
  line 2<br />
  line 3<br />
  line 4<br />
  line 5<br />
  <b>Here's some more text</b>
SOME_TEXT

parsed = Hpricot(text)

parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed

I would expect the result to be

<br />
line 1<br />  
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>

But I am getting

<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>

How can I make Hpricot return line 1, line 2, etc?


Solution

  • Your first step is to read the following_siblings documentation:

    Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.

    Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:

    parsed        = Hpricot(text)
    link          = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
    link_sibs     = link.parent.children
    what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]
    
    puts what_you_want
    

    That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.