Search code examples
c++performancerecursiondirectoryntfs

fastest way to crawl recursive ntfs directories in C++


I have written a small crawler to scan and resort directory structures.

It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow:

around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal?

This is my code:

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}

Solution

  • There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.

    Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.