Does MongoDB index size depend on field size?

In MongoDb, there is a 32-bit int type (4 byte) and a 96-bit ObjectId type (12 byte). I have noticed that index on 32-bit int field is bigger than on ObjectId field, whereas I was expecting the opposite according to this question: Are there any tools to estimate index size in MongoDB?

Is this specific to ObjectId, and how is it possible?

Here are some stats showing the difference, using MongoDB 3.2.9 and mongodb-java-driver 3.2 with default configuration ( WiredTiger engine + snappy compression level)

"_id" as ObjectId:

    > db.objectId.stats()
{
    "ns" : "test1.objectId",
    "count" : 500000,
    "size" : 20500000,
    "avgObjSize" : 41,
    "storageSize" : 6737920,
    [...]
    "nindexes" : 1,
    "totalIndexSize" : 4300800,
    "indexSizes" : {
        "_id_" : 4300800
    }
}

"_id" as int32 (linear insert):

     > db.int32linear.stats()
{
    "ns" : "test1.int32linear",
    "count" : 500000,
    "size" : 16500000,
    "avgObjSize" : 33,
    "storageSize" : 5586944,
    [...]
    "nindexes" : 1,
    "totalIndexSize" : 5255168,
    "indexSizes" : {
        "_id_" : 5255168
    }
}

"_id" as int32 (random insert):

> db.int32random.stats()
{
    "ns" : "test1.int32random",
    "count" : 500000,
    "size" : 16500000,
    "avgObjSize" : 33,
    "storageSize" : 5595136,
    [...]
    "nindexes" : 1,
    "totalIndexSize" : 5378048,
    "indexSizes" : {
        "_id_" : 5378048
    }
}

Here is the code to reproduce the test:

import com.mongodb.MongoClient;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.bson.Document;

public class Main {

    public static void main(String[] args) {

        List<Document> listDoc = new ArrayList<>(); 

        MongoClient mongoClient = new MongoClient(); 
        MongoDatabase db = mongoClient.getDatabase("test1");

        MongoCollection<Document> objectId = db.getCollection("objectId");
        MongoCollection<Document> int32linear = db.getCollection("int32linear");
        MongoCollection<Document> int32random = db.getCollection("int32random");

        for(int i = 0; i<500000; i++){
            listDoc.add(new Document("field", "content" ));
        }
        objectId.insertMany(listDoc);
        listDoc.clear();

        for (int i = 0; i<500000; i++){
            listDoc.add(new Document("_id", i).append("field", "content"));
        }
        int32linear.insertMany(listDoc);
        // unsort the array
        Collections.shuffle(listDoc); 

        int32random.insertMany(listDoc);

        mongoClient.close();

    }
}

Solution

I am not certain but: WildTiger is effectively compressing the object ID keys. If you look at how they are generated, and if all the docs are inserted super quickly (in a few seconds), on a single machine, there will be a very long common prefix to the object IDs. WildTiger's key prefix compression will then be quite effective.

So why does this not works with incrementing integers? Because of the little endian format.

If the above hypothesis is correct, in practice on a real system where the insert times gap more and there are many servers (sharded) the ObjectId index would likely be a bit bigger than the int index -- but still pretty reasonable in size. If you want to check this try turning off compression on the index build.

Overall I think this is good news as the issue isn't that the int index is big, but that the ObjectId index is efficient -- ~10 bytes/entry is reasonable (although I could imagine doing better) given there is a record-id in addition to the key for each document.

https://docs.mongodb.com/manual/reference/method/ObjectId/

p.s. I believe the incrementing int index is a little smaller than the random one as there is a modest optimization in mmap for ascending keys.