c++windows performance ntfs directory-traversal

Opening many small files on NTFS is way too slow

I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.

I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?

Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).

So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.

Solution

What you are trying to do is intrinsically difficult for any operating system to do efficiently. 45,000 subdirectories requires a lot of disk access no matter how it is sliced.

Any file over about 1,000 bytes is "big" as far as NTFS is concerned. If there were a way to make most data files less than about 900 bytes, you could realize a major efficiency by having the file data stored inside the MFT. Then it would be no more expensive to obtain the data than it is to obtain the file's timestamps or size.

I doubt there is any way to optimize the program's parameters, process options, or even the operating system's tuning parameters to make the application work well. You are faced with multi-hour operation unless you can rearchitect it in a radically different way.

One strategy would be to distribute the files across multiple computers—probably thousands of them—and have a sub-application on each process the local files, feeding whatever results to a master application.

Another strategy would be to re-architect all the files into a few larger files, like big .zip files as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently far more efficient and effective use of resources than accessing 4 billion 1 MB files. Also moving all the data into a suitable database manager (MySQL, SQL Server, etc.) would accomplish this and perhaps provide other benefits like easy searches and an easy archival strategy.