Search code examples
rubynokogiriopen-uri

Getting all unique URL's using nokogiri


I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.

Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.

require 'nokogiri'
require 'open-uri'
require 'active_record'

ARGV[0]="https://www.nku.edu/academics/informatics.html"

ARGV.each do |arg|
  open(arg) do |f|
    # Display connection data
    puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
    [:base_uri, :meta, :status, :charset, :content_encoding,
    :content_type, :last_modified].each do |method|
      puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
    end

    # Display the href links
    base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
    puts "base_url: #{base_url}"

    Nokogiri::HTML(f).css('a').each do |anchor|
      href = anchor['href']
      # Make Unique

      if href =~ /.*informatics/
        puts href
        #store stuff to active record
       end
     end
  end
end

Solution

  • Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:

    require 'nokogiri'
    require 'open-uri'
    require 'active_record'
    
    ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
    
    ARGV.each do |arg|
      open(arg) do |f|
        puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
    
        %i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
          puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
        end
    
        puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
    
        anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
        puts anchors.map { |anchor| anchor['href'] }.uniq
      end
    end
    

    See output.