Search code examples
rubyweb-scrapingnokogirinet-httpopen-uri

net/http automatically redirects webpage to another language


I'm trying to use open-uri to scrape the data from:

https://www.zomato.com/grande-lisboa/fu-hao-massamá

But, the website is automatically redirecting to:

https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá

I don't want the spanish version. I want the english one. How do I tell ruby to stop doing that?


Solution

  • This is called content negotiation - the web server redirects based on your request. pt (Portuguese) seems to be the default: (at least from my location)

    $ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
    HTTP/1.1 301 Moved Permanently
    Set-Cookie: zl=pt; ...
    Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1
    

    You can request another language by sending an Accept-Language header. Here's the answer for Accept-Language: es (Spanish):

    $ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
    HTTP/1.1 301 Moved Permanently
    Set-Cookie: zl=es_cl; ...
    Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1
    

    And here's the answer for Accept-Language: en (English):

    $ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
    HTTP/1.1 200 OK
    Set-Cookie: zl=en; ...
    

    This seems to be the resource you've been looking for.

    In Ruby you'd use:

    require 'nokogiri'
    require 'open-uri'
    
    url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
    headers = {'Accept-Language' => 'en'}
    
    doc = Nokogiri::HTML(open(url, headers))
    doc.at('html')[:lang]
    #=> "en"