Search code examples
mongodbmongodb-querymongodb-c

Query on large collections using the MongoDb C++ driver differs from shell


Windows 7 64 SP1 -- MongoDB 2.2.0-rc2 -- Boost 1.42 -- MS VS 2010 Ultimate -- C++ driver

Following "Mongo in Action", in the shell:

for(i=0; i<200000; i++){
  db.numbers.save({num: i});
}

db.numbers.find() displays:

{ "_id": ObjectId("4bfbf132dba1aa7c30ac830a"),"num" : 0 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830b"),"num" : 1 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830c"),"num" : 2 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830d"),"num" : 3 }
...

So, replicating in C++:

// Insert 200,000 documents
for ( int i = 0; i < 200000 ; i++)
  c.insert(dc,BSON(GENOID << "num" << i));

//Display the first 20 documents
Query qu = BSONObj();
auto_ptr<DBClientCursor> cursor = c.query(dc,qu); 
for ( int i = 0 ; i < 20 ; i++){
  cout << cursor->next().toString() << endl;
}

The output:

{ "_id" : ObjectId("504bab737ed339cef0e26829"), "num" : 199924 }
{ "_id" : ObjectId("504bab737ed339cef0e2682a"), "num" : 199925 }
{ "_id" : ObjectId("504bab737ed339cef0e2682b"), "num" : 199926 }
{ "_id" : ObjectId("504bab737ed339cef0e2682c"), "num" : 199927 }
....

Invoking db.numbers.find() in the shell has the same output. Why isn't it starting with {"num" : 0}? It exists:

> db.numbers.find({"num" : 0})
{ "_id" : ObjectId("504bab417ed339cef0df5b35"), "num" : 0 }

The _id for {"num" : 0} is before the _id for {"num" : 199924}

And an index on "_id" exists:

> db.numbers.getIndexes()
[
    {
            "v" : 1,
            "key" : {
                    "_id" : 1
            },
            "ns" : "learning.numbers",
            "name" : "_id_"
    }
]

If I add sort by _id by changing the query code to read:

auto_ptr<DBClientCursor> cursor = c.query(dc,qu.sort("_id")); 

then it prints in order:

{ "_id": ObjectId("4bfbf132dba1aa7c30ac830a"),"num" : 0 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830b"),"num" : 1 }
...

This doesn't happen with a smaller collection (say 200) of documents.

The question: Why does it appear that the C++ query isn't using the collection's index on _id? Or what else explains this apparent anomaly (or my lack of understanding?


Solution

  • Indexing and sorting are distinct concepts. You can find data in an index without sorting the results; you can also sort results without using an index (though this isn't recommended).

    Since you have not specified a sort order for your find(), the results will be returned in natural order. For a collection where you have only inserted documents (and never deleted or updated) the natural order should approximate insertion order (unless you happen to be using a capped collection, which is maintained in insertion order).

    Once you start deleting documents or updating them (which may cause them to be moved) there will be free space "gaps" created in MongoDB's preallocated data files. MongoDB will reuse the free space for new document insertions/moves .. so over time the natural order will no longer match the insertion order.

    If you are expecting results in a specific sort order, you have to include this in your query.