I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files'))
. Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen
in combination with shutil
and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files'))
. I thought whenever writing to a file is incomplete or will fail the with
statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.
How can I prevent this error, i.e. make sure that every files get closed?
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
You may try to use contextlib
to close your file as such:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
According to the docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:
ulimit -Hn
Add the following line in your /etc/sysctl.conf
file:
fs.file-max = <number>
Where <number>
is the new upper limit of open files you want to set.
Close and save the file.
sysctl -p
So that changes take effect.