Search code examples
ruby-on-railsrubynokogiriopen-uri

Why does OpenURI return a 404, when the parsed URL works fine in browser?


I'm trying to screen-scrape a URL containing special characters like the Danish character 'ø'.

The URL is:

url = "http://www.zara.com/dk/da/dame/tilbehør/tilbehør/stribet-hue-c271008p2195502.html"

In order to have OpenURI recognize it as a valid URL, I do:

url = Addressable::URI.parse(url).normalize.to_s

and parse it with:

doc = Nokogiri::HTML(open(url))

which returns:

OpenURI::HTTPError: 404 Not Found

I have no clue why OpenURI returns a 404, because the normalized URL works fine in a browser.

Why this is the case and what I have to do to fix it?


Solution

  • I found out that the problem was with the server of the URL I was trying to parse. They rejected the default User-Agent used by OpenURI.

    From the documentation on OpenURI, it says that additional header fields can be specified by an optional hash argument:

    open("http://www.ruby-lang.org/en/",
      "User-Agent" => "Ruby/#{RUBY_VERSION}",
      "From" => "foo@bar.invalid",
      "Referer" => "http://www.ruby-lang.org/") {|f|
      # ...
    }
    

    I just used a different User-Agent and everything worked fine.