mongodb aggregation-framework document-versioning

MongoDB Aggregation query running very slow

We version most of our collections in Mongodb. The selected versioning mechanism is as follows:

{  "docId" : 174, "v" : 1,  "attr1": 165 }   /*version 1 */
{  "docId" : 174, "v" : 2,  "attr1": 165, "attr2": "A-1" } 
{  "docId" : 174, "v" : 3,  "attr1": 184, "attr2" : "A-1" }

So, when we perform our queries we always need to use the aggregation framework in this way to ensure get latest versions of our objects:

db.docs.aggregate( [  
    {"$sort":{"docId":-1,"v":-1}},
    {"$group":{"_id":"$docId","doc":{"$first":"$$ROOT"}}}
    {"$match":{<query>}}
] );

The problem with this approach is once you have done your grouping, you have a set of data in memory which has nothing to do with your collection and thus, your indexes cannot be used.

As a result, the more documents your collection has, the slower the query gets.

Is there any way to speed this up?

If not, I will consider to move to one of the approaches defined in this good post: http://www.askasya.com/post/trackversions/

Solution

Just in order to complete this question, we went with option 3: one collection to keep latest versions and one collection to keep historical ones. It is introduced here: http://www.askasya.com/post/trackversions/ and some further description (with some nice code snippets) can be found in http://www.askasya.com/post/revisitversions/.

It has been running in production now for 6 months. So far so good. Former approach meant we were always using the aggregate framework which moves away from indexes as soon as you modify the original schema (using $group, $project...) as it doesn't match anymore the original collection. This was making our performance terrible as the data was growing.

With the new approach though the problem is gone. 90% of our queries goes against latest data and this means we target a collection with a simple ObjectId as identifier and we do not require aggregate framework anymore, just regular finds.

Our queries against historical data always include id and version so by indexing these (we include both as _id so we get it out of the box), reads towards those collections are equally fast. This is a point though not to overlook. Read patterns in your application are crucial when designing how your collections/schemas should look like in MongoDB so you must ensure you know them when taking such decisions.