Search code examples
rubyregexhpricot

hpricot: get image from URL and parse element


i am trying to get the exact URL of an image inside a page and then download it. i haven't yet gotten to the download point, as i am trying to isolate the URL of the image. here is the code:

#!/usr/bin/ruby -w

require 'rubygems'
require 'hpricot'
require 'open-uri'

raw = Hpricot(open("http://www.amazon.com/Weezer/dp/B000003TAW/"))
ele = raw.search("img[@src*=jpg]").first
img = ele.match("(\")(.*?)(\")").captures
puts img[1]

when i run it as it is, i receive:

undefined method `match' for #<Hpricot::Elem:0xb731948c> (NoMethodError)

if i comment out the last 2 lines and add

puts ele

i get:

<img src="http://ecx.images-amazon.com/images/I/51rpVNqXmYL._SL500_AA240_.jpg" style="display:none;" />

which is the correct portion of the page i want to parse. however, the error is when i try to get just the "http://ecx.images-amazon.com/images/I/51rpVNqXmYL._SL500_AA240_.jpg" style="display:none;" part.

i am not totally sure why it can't perform a match, as I understand the search i am running should be getting an array of the image elements and returning the first. so i assumed that i could not run the match on the entire array, so i tried

img = ele[1].match("(\")(.*?)(\")").captures
puts img

and that returns

undefined method `match' for nil:NilClass (NoMethodError)

i am lost. please excuse my ignorance, as i am just beginning to learn ruby. any help is appreciated.


Solution

  • Change this line:

    img = ele.match("(\")(.*?)(\")").captures
    

    To:

    img = ele[:src]
    

    The reason for the errors is that Hpricot:Elem isn't a string. Try:

    ele.responde.to? :match
    

    and you get false.

    However, you could do:

    ele.to_s.match("(\")(.*?)(\")").captures[1]
    

    the secret is in the to_s