Search code examples
rubynokogiriopen-uri

Why doesn't Nokogiri load the full page?


I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?

Here's my code:

url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
  page = Nokogiri::HTML(open(url))
rescue   OpenURI::HTTPError=>e
  puts "No article found for " + country_name
end

language_part = page.css('div#p-lang')

Test:

with country_name = "France"
=> []

with country_name = "Thailand"
=> really long array that I don't want to quote here,
   but containing all the right data

Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.


Solution

  • Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.

    require 'open-uri'
    require 'zlib'
    
    stream = open('http://en.wikipedia.org/wiki/France')
    if (stream.content_encoding.empty?)
      body = stream.read
    else
      body = Zlib::GzipReader.new(stream).read
    end
    
    p body
    

    Here's what you can key off of:

    >> require 'open-uri' #=> true
    >> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
    >> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []
    

    In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.

    Doing all the stuff above and tossing it to:

    require 'nokogiri'
    page = Nokogiri::HTML(body)
    language_part = page.css('div#p-lang')
    

    should get you back on track.

    Do this after all the above to confirm visually you're getting something usable:

    p language_part.text.gsub("\t", '')
    

    See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.