Mongoose Cursors with Many Documents and Load

We've been using mongoose in Node.Js/Express for sometime and one of the things we are not clear about is, what happens when you have a query using find and you have a large result set of documents. For example, let's say you wanted to iterate through all your users to do some low-priority background processing.

let cursor = User.find({}).cursor();
cursor.on('data',function(user) {
   // do some processing here 
});

My understanding is that cursor.on('data') doesn't block. Therefore, if you have let's say 100,000 users, you would overwhelm the system trying to process 100,000 people nearly simultaneously. There does not seem to be a "next" or other method to regulate our ability to consume the documents.

How do you process large document result sets?

Solution

Mongoose actually does have a .next() method for cursors! Check out the Mongoose documentation. Here is a snapshot of the Example section as of this answer:

// There are 2 ways to use a cursor. First, as a stream:
Thing.
  find({ name: /^hello/ }).
  cursor().
  on('data', function(doc) { console.log(doc); }).
  on('end', function() { console.log('Done!'); });

// Or you can use `.next()` to manually get the next doc in the stream.
// `.next()` returns a promise, so you can use promises or callbacks.
var cursor = Thing.find({ name: /^hello/ }).cursor();
cursor.next(function(error, doc) {
  console.log(doc);
});

// Because `.next()` returns a promise, you can use co
// to easily iterate through all documents without loading them
// all into memory.
co(function*() {
  const cursor = Thing.find({ name: /^hello/ }).cursor();
  for (let doc = yield cursor.next(); doc != null; doc = yield cursor.next()) {
    console.log(doc);
  }
});

With the above in mind, it's possible that your data set could grow to be quite large and difficult to work with. It may be a good idea for you to consider using MongoDB's aggregation pipeline for simplifying the processing of large data sets. If you use a replica set, you can even set a readPreference to direct your large aggregation queries to secondary nodes, ensuring that the primary node's performance remains largely unaffected. This would shift the burden from your server to less-critical secondary database nodes.

If your data set is particularly large and you perform the same calculations on the same documents repeatedly, you could even consider storing precomputed aggregation results in a "base" document and then apply all unprocessed documents on top of that "base" as a "delta"--that is, you can reduce your computations down to "every change since the last saved computation".

Finally, there's also the option of load balancing. You could have multiple application servers for processing and have a load balancer distributing requests roughly evenly between them to prevent any one server from becoming overwhelmed.

There are quite a few options available to you for avoiding a scenario where your systems becomes overwhelmed from all of the data processing. The strategies you should employ will depend largely on your particular use case. In this case, however, it seems as though this is a hypothetical question, so the additional strategies noted probably will not be things you will need to concern yourself with. For now, stick with the .next() calls and you should be fine.