I am using the urllib2 module in Python 2.7 using Spyder 3.0 to batch download text files by reading a text file that contains a list of them:
reload(sys)
sys.setdefaultencoding('utf-8')
with open('ocean_not_templated_url.txt', 'r') as text:
lines = text.readlines()
for line in lines:
url = urllib2.urlopen(line.strip('ï \xa0\t\n\r\v'))
with open(line.strip('\n\r\t ').replace('/', '!').replace(':', '~'), 'wb') as out:
for d in url:
out.write(d)
I've already discovered a bunch of weird characters in the urls that I've since stripped, however, the script fails when nearly 90% complete, giving the following error:
I thought it to be a non-breaking space (denoted by \xa0 in the code), but it still fails. Any ideas?
That's an odd URL!
Specify the communication protocol over the network. Try prefixing the URL with http://
and the domain names if the file exists on the WWW.
Files always reside somewhere, in some server's directory, or locally on your system. So there must be a network path to such files, for example:
http://127.0.0.1/folder1/samuel/file1.txt
Same example, with localhost being an alias for 127.0.0.1 (generally)
http://localhost/folder1/samuel/file1.txt
That might solve the problem. Just think about where your file exists and how it should be addressed...
Update:
I experimented quite a bit on this. I think I know why that error is raised! :D
I speculate that your file which stores the URL's actually has a sneaky empty line near the end. I can say it's near the end as you said that it executes about 90% of it and then fails. So, the python urllib2 function get_type is unable to process that empty url and throws
unknown url type:
I think that's the problem! Remove that empty line in the file ocean_not_templated_url.txt
and try it out!
Just check and let me know! :P