Search code examples
rubyruby-on-rails-3screen-scrapingnokogiri

Why do I get "wrong status line" errors from Nokogiri?


My Ruby/Nokogiri script is:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

f = File.new("enterret" + ".txt", 'w')

1.upto(100) do |page|
  urltext = "http://xxxxxxx.com/" + "page/"
  urltext << page.to_s + "/"
  doc = Nokogiri::HTML(open(urltext))
  doc.css(".photoPost").each do |post|
    quote = post.css("h1 + p").text
    author = post.css("h1 + p + p").text
    f.puts "#{quote}" + "#{author}"
    f.puts "--------------------------------------------------------"
  end
end

When running this script i get the following error:

http.rb:2030:in `read_status_line': wrong status line: "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"" (Net::HTTPBadResponse)

However my script writes to file correctly, it just that this error keeps coming up. What does the error mean?


Solution

  • Without knowing what site you are accessing it is hard to say for sure, but I suspect that the problem isn't in Nokogiri.

    The error is being reported by http.rb, which would most likely be complaining about the HTTPd headers being returned. http.rb is concerned with the handshake with the HTTPd server and would whine about missing/malformed headers, but it wouldn't care about the payload.

    Nokogiri, on the other hand, would be concerned about the payload, i.e., the HTML. The DOCTYPE is supposed to be part of the HTML payload, so I suspect their server is sending a HTML DOCTYPE instead of a MIME doctype, which should be "text/html".

    In the Ruby 1.8.7 http.rb file you'll see the following lines at 2030 in the code:

    def response_class(code)
      CODE_TO_OBJ[code] or
      CODE_CLASS_TO_OBJ[code[0,1]] or
      HTTPUnknownResponse
    end
    

    That seems a likely place to generate the sort of message you're seeing.