Search code examples
ruby-on-railsrubyopen-uri

Receiving NoMethodError when attempting to retrieve HTML source for webpage using open


As part of my webpage I need to use Open-URI in order to grab the source for a webpage. For some reason whenever I try to grab the source for the webpage found at http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1 I get a NoMethodError stating "undefined method `+' for nil:NilClass". I'm not sure what's causing the issue. The webpage seems to load fine when accessed from my web browser. Here's a snippet you can run in the console to recreate this error.

require 'open-uri'
open("http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1")

Thanks in advance!

EDIT: Here's the full error message in case anybody is interested.

NoMethodError: undefined method `+' for nil:NilClass
    from /usr/lib64/ruby/2.1.0/net/http.rb:1530:in `addr_port'
    from /usr/lib64/ruby/2.1.0/net/http.rb:1463:in `begin_transport'
    from /usr/lib64/ruby/2.1.0/net/http.rb:1405:in `transport_request'
    from /usr/lib64/ruby/2.1.0/net/http.rb:1379:in `request'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:343:in `block in open_http'
    from /usr/lib64/ruby/2.1.0/net/http.rb:854:in `start'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:336:in `open_http'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:751:in `buffer_open'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:214:in `block in open_loop'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:211:in `catch'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:211:in `open_loop'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:152:in `open_uri'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:731:in `open'
    from /usr/lib64/ruby/2.1.0/open-uri.rb:34:in `open'
    from (irb):2
    from /usr/bin/irb:11:in `<main>'

I've started looking through the source code for the files listed above to no avail so far.


Solution

  • This isn't a problem with your code; rather, it's a case of the New York Times paywall messing up your day. The error you're getting is a traceback entirely from the standard library (see how all the paths begin /usr/lib64?), which is a strong indicator that it's not a problem with your code. Sometimes you'll get errors like this when you're using the library incorrectly, but you've already determined that your code works for other URLs. So how can we figure out what's going on?

    Ruby's open-uri module is a wrapper around the net/http module. We can find out more about what's going on by using the net/http module directly:

    require 'net/http'
    uri = URI("http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1")
    response = Net::HTTP.get_response(uri)
    p response # #<Net::HTTPSeeOther 303 See Other readbody=true>
    p response['location'] # "http://www.nytimes.com/glogin?URI=http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/&OQ=_phpQ3DtrueQ26_typeQ3DblogsQ26_phpQ3DtrueQ26_typeQ3DblogsQ26_phpQ3DtrueQ26_typeQ3DblogsQ26_rQ3D2Q26&OP=e8954d71Q2FgyQ2BvgdMvgQ27Q27Q27gEQ2BQ2BQ51JQ23yiuPQ2BUQ2B"
    

    When retrieved from Ruby, that URL responds with 303 See Other, and attempts to redirect us to a login page. This isn't directly related to the paywall, but it's a similar theme: the New York Times is protective of its content, and would rather people didn't use computers to read it.

    Sometimes, you can fool websites into giving you the content by spoofing the user agent, but it seems the NYT are wise to that. I couldn't get the site to send me anything other than a 303 response, but if you're persistent you could probably find a way.

    But if this web page isn't crucial to your app and you'd just like to stop it crashing, I'd write something like this:

    require 'net/http'
    uri = URI("http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/?_php=true&_type=blogs&_php=true&_type=blogs&_r=1")
    response = Net::HTTP.get_response(uri)
    
    if response.body.empty?
      # Show the user an error message
    else
      # Process the contents of the webpage here, accessed via response.body
    end