Search code examples
rubyhttp-status-code-404nokogiriopen-uri

404 not found, but can access normally from web browser


I tried many URLs on this and they seem to be fine until I came across this particular one:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc

This is the result:

/Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError)
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from test.rb:5:in `<main>'  

I can access this from a web browser, I just don't get it at all.

What is going on, and how can I deal with this kind of error? Can I ignore it and let the rest do their work?


Solution

  • You're getting 404 Not Found (OpenURI::HTTPError), so, if you want to allow your code to continue, rescue for that exception. Something like this should work:

    require 'nokogiri'
    require 'open-uri'
    
    URLS = %w[
      http://www.moxyst.com/fashion/men-clothing/underwear.html
    ]
    
    URLs.each do |url|
      begin
        doc = Nokogiri::HTML(open(url))
      rescue OpenURI::HTTPError => e
        puts "Can't access #{ url }"
        puts e.message
        puts
        next
      end
      puts doc.to_html
    end
    

    You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.

    You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.

    I can access this from a web browser, I just don't get it at all.

    Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:

    Additional header fields can be specified by an optional hash argument.

    open("http://www.ruby-lang.org/en/",
      "User-Agent" => "Ruby/#{RUBY_VERSION}",
      "From" => "foo@bar.invalid",
      "Referer" => "http://www.ruby-lang.org/") {|f|
      # ...
    }