Search code examples
ruby-on-railsrubyscreen-scrapingnokogiri

Check if Nokogiri HTML document is usable


I want to check if the URL that the user inputs is in fact a valid page.

I tried:

if Nokogiri::HTML(open("http://example.com"))
  #DO REQUIRED TASK
end

But that immediately throws an error upon attempting to open the page. I want to return the result of whether it is a document of any kind.

I either get the error:

no such file or directory

or:

getaddrinfo: Name or service not known

depending on how I try to make the check.


Solution

  • I'd start with something like:

    require 'nokogiri'
    require 'open-uri'
    
    begin
      doc = Nokogiri.HTML(open(url))
    rescue Exception => e
      puts "Couldn't read \"#{ url }\": #{ e }"
      exit
    end
    
    puts (doc.errors.empty?) ? "No problems found" : doc.errors
    

    Nokogiri sets the document's errors array to the values of any errors that occurred during the parsing process.

    This only addresses one part of the issue though. Malicious people like to break things, and this would be very easy to break. In general, be very careful about anything a user gives you, especially if your site is exposed to the wild internet.

    Prior to telling OpenURI to load the file to give to Nokogiri, you should sniff that URL and do some sanity checks using a HTTP HEAD request to find out the size and MIME-TYPE of the content being retrieved. Once you know those, you can try loading the file.