Search code examples
ruby-on-railshtml-parsingnokogiristack-overflow

Rails + Nokogiri + Heroku - response 503 for URLs from StackOverflow


I'm writing a just-for-fun app for my use. In this app I'm putting URLs in classic POST form from which I'm extracting some informations. For example, this line is where I'm extracting the title of the page:

self.name = Nokogiri::HTML(open(self.url)).css('title').to_s.sub('<title>','').to_s.sub('</title>','')

I'm using Nokogiri (v1.5.4) for parsing data from the source page. I don't know if I'm missing here something, but the behavior of the application is strange.

If I'm running on my localhost in my development environment on my machine, everything works properly and seems to me alright. But, after pushing on Heroku, some problems occurred. For example, URLs from StackOverflow always have this type of error:

OpenURI::HTTPError (503 Service Unavailable):
app/models/url.rb:67:in `set_name'
app/controllers/urls_controller.rb:48:in `block in create'
app/controllers/urls_controller.rb:46:in `create'

I don't understand why it is happening just on Heroku. On my local machine it's working perfectly with the same URL. I'm maybe missing something with Heroku, but other URLs are returning the normal 200 state and working fine. It's just URLs from StackOverflow.


Solution

  • Don't use:

    .to_s.sub('<title>','').to_s.sub('</title>','')
    

    Instead use:

    .text
    

    For instance:

    html = '<head><title>foo</title></head>'
    Nokogiri::HTML(html).css('title').text
    

    In IRB:

    irb(main):055:0> html = '<head><title>foo</title></head>'
    "<head><title>foo</title></head>"
    irb(main):056:0> Nokogiri::HTML(html).css('title').text
    "foo"
    

    Why URLs for StackOverflow fail on Heroku fail with a 503 might be a routing or hosting issue since you're getting a 503.

    Rather than scraping pages, you might want to consider "Where is Stack Overflow's public data dump?" and " Stack Overflow Creative Commons Data Dump".