Operations on files collection and aggregate results report with non-blocking IO

I would like to perform some arbitrarily expensive work on an arbitrarily large set of files. I would like to report progress in real-time and then display results after all files have been processed. If there are no files that match my expression, I'd like to to throw an error.

Imagine writing a test framework that loads up all of your test files, executes them (in no particular order), reports on progress in real-time, and then displays aggregate results after all tests have been completed.

Writing this code in a blocking language (like Ruby for example), is extremely straightforward.

As it turns out, I'm having trouble performing this seemingly simple task in node, while also truly taking advantage of asynchronous, event-based IO.

My first design, was to perform each step serially.

Load up all of the files, creating a collection of files to process
Process each file in the collection
Report the results when all files have been processed

This approach does work, but doesn't seem quite right to me since it causes the more computationally expensive portion of my program to wait for all of the file IO to complete. Isn't this the kind of waiting that Node was designed to avoid?

My second design, was to process each file as it was asynchronously found on disk. For the sake of argument, let's imagine a method that looks something like:

eachFileMatching(path, expression, callback) {
  // recursively, asynchronously traverse the file system,
  // calling callback every time a file name matches expression.
}

And a consumer of this method that looks something like this:

eachFileMatching('test/', /_test.js/, function(err, testFile) {
  // read and process the content of testFile
});

While this design feels like a very 'node' way of working with IO, it suffers from 2 major problems (at least in my presumably erroneous implementation):

I have no idea when all of the files have been processed, so I don't know when to assemble and publish results.
Because the file reads are nonblocking, and recursive, I'm struggling with how to know if no files were found.

I'm hoping that I'm simply doing something wrong, and that there is some reasonably simple strategy that other folks use to make the second approach work.

Even though this example uses a test framework, I have a variety of other projects that bump up against this exact same problem, and I imagine anyone writing a reasonably sophisticated application that accesses the file system in node would too.

Solution

As it turns out, the smallest working solution that I've been able to build is much more complicated than I hoped.

Following is code that works for me. It can probably be cleaned up or made slightly more readable here and there, and I'm not interested in feedback like that.

If there is a significantly different way to solve this problem, that is simpler and/or more efficient, I'm very interested in hearing it. It really surprises me that the solution to this seemingly simple requirement would require such a large amount of code, but perhaps that's why someone invented blocking io?

The complexity is really in the desire to meet all of the following requirements:

Handle files as they are found
Know when the search is complete
Know if no files are found

Here's the code:

/**
 * Call fileHandler with the file name and file Stat for each file found inside
 * of the provided directory.
 *
 * Call the optionally provided completeHandler with an array of files (mingled
 * with directories) and an array of Stat objects (one for each of the found
 * files.
 *
 * Following is an example of a simple usage:
 *
 *   eachFileOrDirectory('test/', function(err, file, stat) {
 *     if (err) throw err;
 *     if (!stat.isDirectory()) {
 *       console.log(">> Found file: " + file);
 *     }
 *   });
 *
 * Following is an example that waits for all files and directories to be 
 * scanned and then uses the entire result to do something:
 *
 *   eachFileOrDirectory('test/', null, function(files, stats) {
 *     if (err) throw err;
 *     var len = files.length;
 *     for (var i = 0; i < len; i++) {
 *       if (!stats[i].isDirectory()) {
 *         console.log(">> Found file: " + files[i]);
 *       }
 *     }
 *   });
 */
var eachFileOrDirectory = function(directory, fileHandler, completeHandler) {
  var filesToCheck = 0;
  var checkedFiles = [];
  var checkedStats = [];

  directory = (directory) ? directory : './';

  var fullFilePath = function(dir, file) {
    return dir.replace(/\/$/, '') + '/' + file;
  };

  var checkComplete = function() {
    if (filesToCheck == 0 && completeHandler) {
      completeHandler(null, checkedFiles, checkedStats);
    }
  };

  var onFileOrDirectory = function(fileOrDirectory) {
    filesToCheck++;
    fs.stat(fileOrDirectory, function(err, stat) {
      filesToCheck--;
      if (err) return fileHandler(err);
      checkedFiles.push(fileOrDirectory);
      checkedStats.push(stat);
      fileHandler(null, fileOrDirectory, stat);
      if (stat.isDirectory()) {
        onDirectory(fileOrDirectory);
      }
      checkComplete();
    });
  };

  var onDirectory = function(dir) {
    filesToCheck++;
    fs.readdir(dir, function(err, files) {
      filesToCheck--;
      if (err) return fileHandler(err);
      files.forEach(function(file, index) {
        file = fullFilePath(dir, file);
        onFileOrDirectory(file);
      });
      checkComplete();
    });
  }

  onFileOrDirectory(directory);
};