I am currently struggling to scrap the following website : http://mangafox.me and am having issues with open :
the following code works just fine for most sites, but I am having issues with mangafox :
require 'open-uri'
html = open('http://mangafow.me', 'User-Agent' => "Ruby/#{RUBY_VERSION}")
I get very fast reponse from https://google.com and most of the tested sites, but keep getting OpenTimeout exceptions on http://mangafox.me and only get the html page sometimes ( after many tries ).
The navigators however, work just fine and have no issues ( even when emptying the cache ) displaying the website quickly.
I am currently using Ruby 2.4.0 and have tried the code on both an archlinux ( manjaro ) and a debian ( ubuntu in windows 10 ) in 2 different locations ( to ensure that my IP is not the issue ).
I also put a sleep ( 0.5 seconds ) between each open to avoid being blocked by doing too many requests.
I also had the same issue with the curb
gem
require 'curb'
html = Curl.get(link)
Since the navigators ( tried firefox and chromium ) work perfectly, should I try to imitate them ( by emulating one for example ) ? Or is there an easier solution ( gem / other way to use open / ... ) ?
First, you aren't making it clear how you've determined that your browser is faster than the ruby open-uri.
Regardless, there are a number of possibilities:
One of the first tests (after you are clear about how you are determining "speed" versus a browser) would be to try using the same user-agent as your browser, and possibly also having the browser use the same user-agent you are using in ruby.