Search code examples
pythonrssgoogle-alerts

Python - Get redirected url of links from Google Alerts feeds


If you create a google alert as a rss feed (not automaticcaly sent to your e-mail address), it contains links like this one: https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA.

This link is obviously a redirection (just try it and you'll end up here: http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/), but I cannot get this final url with Python (otherwise than by removing the beginning of the url, which is quite ugly).

I've tried so far with packages urllib2, httplib2 and requests:

  • urllib2.urlopen and geturl() from the return value
  • httplib2 request with follow_all_redirects=True and 'content-location' from the return value
  • requests.get and history from the return value

Has someone already been confronted to this issue? Thanks!


Solution

  • Google does not give you a HTTP redirect; a 200 OK response is returned, not a 30x redirect:

    >>> import requests
    >>> url = 'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
    >>> response = requests.get(url)
    >>> response.url
    u'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
    >>> response.text
    u'<script>window.googleJavaScriptRedirect=1</script><script>var m={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};m.navigateTo(window.parent,window,"http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/");\n</script><noscript><META http-equiv="refresh" content="0;URL=\'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/\'"></noscript>'
    

    The response is a piece of HTML and JavaScript that your browser will interpret as loading a new URL. You'll have to parse that response to extract the target.

    String splitting could achieve that:

    >>> response.text.partition("URL='")[-1].rpartition("'\"")[0]
    u'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/'
    

    If we assume that the URL parameter in the body is just a direct reflection of the url parameter in the query string, then you can just extract it from there too, and we don't even have to ask Google to execute the redirect:

    try:
        from urllib.parse import parse_qs, urlsplit
    except ImportError:
        # Python 2
        from urlparse import parse_qs, urlsplit
    
    target = parse_qs(urlsplit(url).query)['url'][0]