Search code examples
rubynokogiriopen-uri

Iterating through multiple URLs to parse HTML with Nokogori


What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.

Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?

Here is the code:

require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI

@@collection = Array.new # Array to hold meta hash

def scrape(url, vendor, item_path, name_path, price_path)
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end
end

scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

Solution

  • You can pass multiple url's the same way you're already doing it in you second example:

    scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")
    

    Your scrape method will have to iterate through those urls, for instance:

    def scrape(urls, vendor, item_path, name_path, price_path)
      urls.each do |url|
        doc = Nokogiri::HTML(open(url)) # Opens URL
        items = doc.css(item_path) # Sets items
        items.each do |item| # Iterates through each item on grid
            @@collection << meta = Hash.new # Creates a new hash then add to global array
            meta[:vendor] = vendor
            meta[:name] = item.css(name_path).text.strip
            meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
        end 
      end   
    end
    

    This also means that the first example need also be passed as an array:

    scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")