I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>
.
require 'hpricot'
text = <<SOME_TEXT
<a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
I would expect the result to be
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
But I am getting
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?
Your first step is to read the following_siblings documentation:
Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.
Then you should use the Hpricot source to generalize how following_siblings
works to get something that works like following_siblings
but doesn't filter out non-container nodes:
parsed = Hpricot(text)
link = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]
puts what_you_want
That's pretty much following_siblings
with parent.children
instead of parent.containers
. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.