Search code examples
rubyhpricot

Get element after another elements with Hpricot and Ruby


I have the following HTML:

<ul class="filtering_new" width="50%">
     <li class="filter">1</li>
     <li class="filter">2</li>
     <script>Alert('1');</script>
     <li class="filter">3</li>
</ul>

How can I get li with inner_html = 3?

I tried like this:

page.search("//ul.filtering_new").each do |list|
     puts list.search("li").size  
end

where page is the HTML document.

size = 2, but it should be 3.

I tried to do like in manual https://github.com/hpricot/hpricot/wiki/hpricot-challenge but I cannot even find <script.

 list.search("script")

returns nothing.


Solution

  • Most XML/HTML parsing in Ruby uses Nokogiri these days, so I'll recommend that parser. However, both Hpricot and Nokogiri support XPath and CSS, so they are fairly interchangeable.

    I'd go about it this way:

    html = <<EOT
    <ul class="filtering_new" width="50%">
         <li class="filter">1</li>
         <li class="filter">2</li>
         <script>Alert('1');</script>
         <li class="filter">3</li>
    </ul>
    EOT
    
    require 'nokogiri'
    
    doc = Nokogiri::HTML(html)
    li = doc.search('//li[@class="filter"]').select{ |n| n.text.to_i == 3 } 
    li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
    

    That finds the candidate nodes, then returns them as a NodeSet to be iterated over, where they are selected/rejected based on the node's text.

    li = doc.search('//li[text() = "3"]') 
    li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
    

    That offloads more of the comparison to the underlying libXML library, where it runs a lot faster.