Search code examples
htmlruby-on-railsscreen-scrapinghpricot

HTML Scraping with Hpricot (Using Ruby on Rails)


I have read a large deal of tutorials to help out and under Hpricot, the problem that i am finding out it is not scraping all the Html so to speak. I'll elaborate:

The website i am attempting to scrape html off is http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx .

I require to obtain the links that are listed as results ( i need to do this for possible any url on the aforementioned site and hence RSS or such is not beneficial as i need the program to read them off on-the-fly given any url i feed.)

I have tried everything to pull off the specific ID i require (giving in the direct XPATH so on an so forth) but i realised that when i do

doc = Hpricot(open("http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx", 'User-Agent'=>'ruby')) str = doc puts str

the result provided excludes all the html related to the links i need! So which ever method i use to scrape, its not finding the elements required as they are not there according to hpricot.

When i view the Source code in Firefox , i do see them however so i'm very confused. Is there anyone who knows how to go around this issue pls? I have been trying to find my way for ages and i cant manage to find a solution alone! Any help would be highly appreciated


Solution

  • It looks like the site is doing something with the User-Agent. If I change that property to match what my version of Firefox sends, I get the full response body. When I left the property as 'ruby', the response was incomplete. Not sure what the root cause is, but this seemed to alleviate the symptoms.

    require 'rubygems'
    require 'hpricot'
    require 'open-uri'
    
    doc = open("http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx", 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }
    puts doc.search('h6')
    

    Hope this helps!