I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?
Here's my code:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
Test:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.
Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read
on the StringIO object that Open::URI returns.
require 'open-uri'
require 'zlib'
stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
body = stream.read
else
body = Zlib::GzipReader.new(stream).read
end
p body
Here's what you can key off of:
>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []
In this case if it's []
, AKA "text/html", it reads. If it's ["gzip"]
it decodes.
Doing all the stuff above and tossing it to:
require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')
should get you back on track.
Do this after all the above to confirm visually you're getting something usable:
p language_part.text.gsub("\t", '')
See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.