This simple Python 3 script:
import urllib.request
host = "scholar.google.com"
link = "/scholar.bib?q=info:K7uZdMSvdQ0J:scholar.google.com/&output=citation&hl=en&as_sdt=1,14&ct=citation&cd=0"
url = "http://" + host + link
filename = "cite0.bib"
print(url)
urllib.request.urlretrieve("http://scholar.google.com" + url, filename)
raises this exception:
Traceback (most recent call last):
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test2.py", line 8, in <module>
urllib.request.urlretrieve("http://scholar.google.com" + url, filename)
File "C:\Python32\lib\urllib\request.py", line 150, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "C:\Python32\lib\urllib\request.py", line 1569, in retrieve
fp = self.open(url, data)
File "C:\Python32\lib\urllib\request.py", line 1541, in open
raise IOError('socket error', msg).with_traceback(sys.exc_info()[2])
File "C:\Python32\lib\urllib\request.py", line 1537, in open
return getattr(self, name)(url)
File "C:\Python32\lib\urllib\request.py", line 1715, in open_http
return self._open_generic_http(http.client.HTTPConnection, url, data)
File "C:\Python32\lib\urllib\request.py", line 1695, in _open_generic_http
http_conn.request("GET", selector, headers=headers)
File "C:\Python32\lib\http\client.py", line 967, in request
self._send_request(method, url, body, headers)
File "C:\Python32\lib\http\client.py", line 1005, in _send_request
self.endheaders(body)
File "C:\Python32\lib\http\client.py", line 963, in endheaders
self._send_output(message_body)
File "C:\Python32\lib\http\client.py", line 808, in _send_output
self.send(msg)
File "C:\Python32\lib\http\client.py", line 746, in send
self.connect()
File "C:\Python32\lib\http\client.py", line 724, in connect
self.timeout, self.source_address)
File "C:\Python32\lib\socket.py", line 386, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
I can open the url that results from the print
statement just fine:
What is causing this? I tried changing http://
to http:///
(three slashes), but the same exception is raised.
Here's your problem:
urllib.request.urlretrieve("http://scholar.google.com" + url, filename)
You're adding the http://scholar.google.com
part twice (url
already starts http://scholar.google.com
). Therefore urillib
thinks you're asking for a page on scholar.google.comhttp
-- needless to say, this domain does not exist. Which is exactly what your error says.
Just request url
obviously.
Handy hint to find this kind of thing faster in the future: when adding a print
statement for debugging, be sure to print the actual value you are using in the command you're debugging. You would have found this in approximately two seconds if your print
statement had also concatenated the base URL.