I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList
is a list of more than 1 million small files. Empirically, I have found that call to open(fileName)
takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.
If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files