Search code examples
rubyxpathcurb

How to go through array of URLs using Curb


I need to parse this page https://www.petsonic.com/snacks-huesos-para-perros/ and recieve information from every item(name,price,image,etc.). The problem is that i don't know how to parse array of URL. If i were using 'open-uri' i would do something like this

require 'nokogiri'
require 'open-uri'


page="https://www.petsonic.com/snacks-huesos-para-perros/"


doc=Nokogiri::HTML(open(page))
links=doc.xpath('//a[@class="product-name"]/@href')

links.to_a.each do|url|
  doc2=Nokogiri::HTML(open(url))
  text=doc2.xpath('//a[@class="product-name"]').text
  puts text
end

However, i am only allowed to use 'Curb' and that's making me confused


Solution

  • You can use the curb gem

    gem install curb
    

    Then in your ruby script

    require 'curb'
    page = "https://www.petsonic.com/snacks-huesos-para-perros/"
    str = Curl.get(page).body
    links = str.scan(/<a(.*?)<\/a\>/).flatten.select{|l| l[/class\=\"product-name/]}
    inner_text_of_links = links.map{|l| l[/(?<=>).*/]}
    puts inner_text_of_links
    

    The hard part of this was the regex let's break it down. To get the links we just scan the string for <a> tags, then get those into an array and flatten them into one array.

    str.scan(/<a(.*?)<\/a\>/)
    

    Then we select the items which match our pattern. We are looking for the class you specified.

    .select{|l| l[/class\=\"product-name/]}
    

    Now to get the innertext of the tag we just map it using a look behind regex

    inner_text_of_links = links.map{|l| l[/(?<=>).*/]}