We have a very big MongoDB
collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates
of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate
or db.collection.mapReduce
to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
limit
option doesn't work on sharded collections, I had to query for _id : { $gte, $lte}
to create the sample size needed.jsMode : true
doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.