If you create a google alert as a rss feed (not automaticcaly sent to your e-mail address), it contains links like this one: https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA.
This link is obviously a redirection (just try it and you'll end up here: http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/), but I cannot get this final url with Python (otherwise than by removing the beginning of the url, which is quite ugly).
I've tried so far with packages urllib2, httplib2 and requests:
Has someone already been confronted to this issue? Thanks!
Google does not give you a HTTP redirect; a 200 OK response is returned, not a 30x redirect:
>>> import requests
>>> url = 'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response = requests.get(url)
>>> response.url
u'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response.text
u'<script>window.googleJavaScriptRedirect=1</script><script>var m={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};m.navigateTo(window.parent,window,"http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/");\n</script><noscript><META http-equiv="refresh" content="0;URL=\'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/\'"></noscript>'
The response is a piece of HTML and JavaScript that your browser will interpret as loading a new URL. You'll have to parse that response to extract the target.
String splitting could achieve that:
>>> response.text.partition("URL='")[-1].rpartition("'\"")[0]
u'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/'
If we assume that the URL
parameter in the body is just a direct reflection of the url
parameter in the query string, then you can just extract it from there too, and we don't even have to ask Google to execute the redirect:
try:
from urllib.parse import parse_qs, urlsplit
except ImportError:
# Python 2
from urlparse import parse_qs, urlsplit
target = parse_qs(urlsplit(url).query)['url'][0]