Search code examples
rubyopen-uri

open slower than web navigators


I am currently struggling to scrap the following website : http://mangafox.me and am having issues with open :

the following code works just fine for most sites, but I am having issues with mangafox :

require 'open-uri'
html = open('http://mangafow.me', 'User-Agent' => "Ruby/#{RUBY_VERSION}")

I get very fast reponse from https://google.com and most of the tested sites, but keep getting OpenTimeout exceptions on http://mangafox.me and only get the html page sometimes ( after many tries ).

The navigators however, work just fine and have no issues ( even when emptying the cache ) displaying the website quickly.

I am currently using Ruby 2.4.0 and have tried the code on both an archlinux ( manjaro ) and a debian ( ubuntu in windows 10 ) in 2 different locations ( to ensure that my IP is not the issue ).
I also put a sleep ( 0.5 seconds ) between each open to avoid being blocked by doing too many requests.

I also had the same issue with the curb gem

require 'curb'
html = Curl.get(link)

Since the navigators ( tried firefox and chromium ) work perfectly, should I try to imitate them ( by emulating one for example ) ? Or is there an easier solution ( gem / other way to use open / ... ) ?


Solution

  • First, you aren't making it clear how you've determined that your browser is faster than the ruby open-uri.

    Regardless, there are a number of possibilities:

    1. Your browser is caching the page locally (your recent comment implies this isn't the case, though a freshly installed chromium could conceivably be using a shared cache that open-uri doesn't know about
    2. Conceivably there is an upstream cache that is caching based on user-agent, though I don't know of such a thing.
    3. The website you are accessing supports a protocol that open-uri does not, such as HTTP/2 or SPDY
    4. The website is serving different content/protocols based on user-agent.
    5. You are being traffic limited (possibly because of your user-agent or your location - you don't mention whether the ruby and browser are running on the same machine)

    One of the first tests (after you are clear about how you are determining "speed" versus a browser) would be to try using the same user-agent as your browser, and possibly also having the browser use the same user-agent you are using in ruby.