Search code examples
rubyxpathscreen-scrapingnokogirihpricot

Screen scraping through nokogiri or hpricot


I'm trying to get actual value of given xpath. I am having the following code in sample.rb file

require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
  begin
    doc.xpath('//*[@id="view_more"]').each do |link|
      puts link.content
    end
  rescue Exception => e
    puts "error" 
  end
end

Output is:

View more issues ..

When I try to get the value for other a different XPath, such as:
/html/body/div[4]/div[3]/h1/span then I get the "error" message.

I tried in this in Nokogiri. I don't know why this is giving result for few XPaths only.

I tried the same in Hpricot.
http://hpricot.com/demonstrations

I paste my url and XPaths and I see the result for
//*[@id="view_more"]
as
View more issues ..
[This text is present at bottom of recent issues header]

But it is not showing result for:
/html/body/div[4]/div[3]/h1/span For this XPath I'm expecting the result Bad.
[This was present in http://www.changebadtogood.com/ as the first header of class="hero-unit" div.]


Solution

  • Your problem has to do with a poor XPath selector, and is unrelated to Nokogiri or Hpricot. Let's investigate:

    irb:01:0> require 'nokogiri'; require 'open-uri'
    #=> true
    irb:02:0> doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')); nil
    #=> nil
    irb:03:0> doc.xpath('//*[@id="view_more"]').each{ |link| puts link.content }
    View more issues ..
    #=> 0
    irb:04:0> doc.at('#view_more').text  # Simpler version of the above.
    #=> "View more issues .."
    irb:05:0> doc.xpath('/html/body/div[4]/div[3]/h1/span')
    #=> []
    irb:06:0> doc.xpath('/html/body/div[4]')
    #=> []
    irb:07:0> doc.xpath('/html/body/div').length
    #=> 2
    

    From this we can see that there are only two divs that are children of the <body> element, and so div[4] fails to select one.

    It appears that you're trying to select the span here:

    <h1 class="landing_page_title">
      Change <span style='color: #808080;'>Bad</span> To Good
    </h1>
    

    Instead of relying on the fragile markup leading up to this (indexing anonymous hierarchies of element), use the semantic structure of the document to your advantage for a selector that is both simpler and more robust. Using either CSS or XPath syntax:

    irb:08:0> doc.at('h1.landing_page_title > span').text
    #=> "Bad"
    irb:09:0> doc.at_xpath('//h1[@class="landing_page_title"]/span').text
    #=> "Bad"