Search code examples
pythonwindowsfilesystemsntfs

Windows 7 directory with over million small (30 kB) files dramatic performance decrease


I have come across trouble regarding performance of my scripts, while generating and using large quantity of small files.

I have two directories on my disk (same behavior on HDD and SSD). First with ~10_000 input files and second for ~1_300_000 output files. I wrote script to process files and generate output using multiprocessing library in Python.

First output 400_000-600_files (not sure when I hit 'threshold') are generated at constant pace and all 8 cores of CPU are used at 100%. Then it gets much worse. Performance decreases 20 times and cores usage drops to 1-3%, when hitting 1_000_000 files in directory.
I omitted this issue by creating second output directory and writing second half of output files there (I needed quick hotfix).

Now, I have two questions:
1) How is creating new and writing to it executed in Python on Windows? What is the bottleneck here? (my guess is that Windows look up if file already exists in directory before writing to it)
2) What is more elegant way (than splitting into dirs) to handle this issue correctly?


Solution

  • In case anyone has the same problem, the bottleneck turned out to be lookup time for files in crowded directories.

    I resolved the issue by splitting files into separate directories grouped by one parameter with even distribution over 20 different variables. Though now I would do it in a different way.

    I recommend solving a similar issue with shelve Python built-in module. A shelve is one file in the filesystem and you can access it like a dictionary and put pickles inside. Just like in real life :) Example here.