Search code examples
rubyxmlhpricot

Hpricot XML text search


Hpricot + Ruby XML parsing and logical selection.

Objective: Find all title written by author Bob.

My XML file:

<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>

<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>

<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
</channel>
</rss>

#my Hpricot, running this code returns no output, however the search pattern works on its own.
 (doc % :rss % :channel / :item).each do |item|

        a=item.search("author[text()*='Bob']")

        #puts "FOUND" if a.include?"Bob"
        puts item.at("title") if a.include?"Bob"

  end

Solution

  • One of the ideas behind XPath is it allows us to navigate a DOM similarly to a disk directory:

    require 'hpricot'
    
    xml = <<EOT
    <rss>
        <channel>
            <item>
                <title>Book1</title>
                <pubDate>march 1 2010</pubDate>
                <author>Bob</author>
            </item>
    
            <item>
                <title>book2</title>
                <pubDate>october 4 2009</pubDate>
                <author>Bill</author>
            </item>
    
            <item>
                <title>book3</title>
                <pubDate>June 5 2010</pubDate>
                <author>Steve</author>
            </item>
    
            <item>
                <title>Book4</title>
                <pubDate>march 1 2010</pubDate>
                <author>Bob</author>
            </item>
    
        </channel>
    </rss>
    EOT
    
    doc = Hpricot(xml)
    
    titles = (doc / '//author[text()="Bob"]/../title' )
    titles # => #<Hpricot::Elements[{elem <title> "Book1" </title>}, {elem <title> "Book4" </title>}]>
    

    That means: "find all the books by Bob, then look up one level and find the title tag".

    I added an extra book by "Bob" to test getting all occurrences.

    To get the item containing a book by Bob, just move back up a level:

    items = (doc / '//author[text()="Bob"]/..' )
    puts items # => nil
    # >> <item>
    # >>             <title>Book1</title>
    # >>             <pubdate>march 1 2010</pubdate>
    # >>             <author>Bob</author>
    # >>         </item>
    # >> <item>
    # >>             <title>Book4</title>
    # >>             <pubdate>march 1 2010</pubdate>
    # >>             <author>Bob</author>
    # >>         </item>
    

    I also figured out what (doc % :rss % :channel / :item) is doing. It's equivalent to nesting the searches, minus the wrapping parenthesis, and these should all be the same in Hpricot-ese:

    (doc % :rss % :channel / :item).size # => 4
    (((doc % :rss) % :channel) / :item).size # => 4
    (doc / '//rss/channel/item').size # => 4
    (doc / 'rss channel item').size # => 4
    

    Because '//rss/channel/item' is how you'd normally see an XPath accessor, and 'rss channel item' is a CSS accessor, I'd recommend using those formats for maintenance and clarity.