Search code examples
mongodbmapreducedatabase-performance

MongoDB - how to get fields fill-rates as quickly as possible?


We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.

We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.

Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side? Should it have significant performance improvements? Will it slow down other usages of that collection (e.g. holding a major lock)?


Solution

  • Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.

    Here's how the old script worked (in node.js):

    var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
    var next = function() {
       cursor.nextObject(function(err, doc) {
          processDoc(doc, next);
       });
    };
    next();
    

    and this is the new script:

    collection.mapReduce(
        function () {
            var processDoc = function(doc) {
                ...
            };
            processDoc(this);
        },
        function (key, values) {
          return Array.sum(values)
        },
        {
          query : query,
          out: {inline: 1}
        },
        function (error, results) {
          // print results
        }
    );
    

    processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:

    emit(field_name, 1);
    

    running old and new on a sample of 100k, old took 20 seconds, new took 8.

    some notes:

    1. map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
    2. map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.