I'm trying to use open-uri
to scrape the data from:
https://www.zomato.com/grande-lisboa/fu-hao-massamá
But, the website is automatically redirecting to:
https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá
I don't want the spanish version. I want the english one. How do I tell ruby to stop doing that?
This is called content negotiation - the web server redirects based on your request. pt
(Portuguese) seems to be the default: (at least from my location)
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=pt; ...
Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1
You can request another language by sending an Accept-Language
header. Here's the answer for Accept-Language: es
(Spanish):
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=es_cl; ...
Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1
And here's the answer for Accept-Language: en
(English):
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
HTTP/1.1 200 OK
Set-Cookie: zl=en; ...
This seems to be the resource you've been looking for.
In Ruby you'd use:
require 'nokogiri'
require 'open-uri'
url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
headers = {'Accept-Language' => 'en'}
doc = Nokogiri::HTML(open(url, headers))
doc.at('html')[:lang]
#=> "en"