Search code examples
mongodbsharding

Default range shard key mongodb


I have a mongodb shard with 2 shards (let say A & B), 17GB free space each. I set _id which contains object ID as shard key.

Below are commands used to set db and collection.

sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:1});

Then I tried to fire 4,000,000 insert queries to mongos server. I execute script below 4 times.

for(var i=0; i<1000000; i++){
  db.shardedCollection.insert({x:i});
}

Using _id as shard key, as per my understanding, 4000000 document as mentioned will fit in 1 shard and all insert will happens in A shard only.

However, the result was not as I expected, it is ~1,3 million documents inserted in A shard, another ~2,7 million documents inserted in B shard.

Why did it happen? Are something missing in shard coll setting commands? Or my understanding is wrong, maybe there is something like default range shard key in mongodb?

It will be very helpful if someone can share the behavior of default range shard key (without tag aware).

Below is sh.status() result

  shard key: { "_id" : 1 }
  chunks:
    B  5
    A  5
  { "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("540c703398c7efdea6037cbc") } on : B Timestamp(6, 0) 
  { "_id" : ObjectId("540c703398c7efdea6037cbc") } -->> { "_id" : ObjectId("540c703498c7efdea603bfe3") } on : A Timestamp(6, 1) 
  { "_id" : ObjectId("540c703498c7efdea603bfe3") } -->> { "_id" : ObjectId("540c704398c7efdea605d818") } on : A Timestamp(3, 0) 
  { "_id" : ObjectId("540c704398c7efdea605d818") } -->> { "_id" : ObjectId("540c705298c7efdea607f04e") } on : A Timestamp(4, 0) 
  { "_id" : ObjectId("540c705298c7efdea607f04e") } -->> { "_id" : ObjectId("540c707098c7efdea60c20ba") } on : B Timestamp(5, 1) 
  { "_id" : ObjectId("540c707098c7efdea60c20ba") } -->> { "_id" : ObjectId("540c7144319c0dbee096f7d6") } on : B Timestamp(2, 4) 
  { "_id" : ObjectId("540c7144319c0dbee096f7d6") } -->> { "_id" : ObjectId("540c7183319c0dbee09f58ad") } on : B Timestamp(2, 6) 
  { "_id" : ObjectId("540c7183319c0dbee09f58ad") } -->> { "_id" : ObjectId("540eb15ddace5b39fbc32239") } on : B Timestamp(4, 2) 
  { "_id" : ObjectId("540eb15ddace5b39fbc32239") } -->> { "_id" : ObjectId("540eb192dace5b39fbca8a84") } on : A Timestamp(5, 2) 
  { "_id" : ObjectId("540eb192dace5b39fbca8a84") } -->> { "_id" : { "$maxKey" : 1 } } on : A Timestamp(5, 3) 

Solution

  • Yeah you are right it should have gone to a single shard. But while there is insertions going on a single shard, the balancer would be also balancing the shards and moving chunks to other shards.

    Having said that, what you should do is stop/disable the balancer by invoking the below command from your mongos.

    http://docs.mongodb.org/manual/reference/method/sh.disableBalancing/#sh.disableBalancing

    sh.disableBalancing(namespace)
    //namespace     string  The namespace of the collection.
    

    Once done, kick off your inserts and see where all the inserts are heading to.

    For _id field sharding you could also look here:

    http://docs.mongodb.org/manual/faq/sharding/#can-you-shard-on-the-id-field

    Be aware that ObjectId() values, which are the default value of the _id field, 
    increment as a timestamp. As a result, when used as a shard key, all new documents
    inserted into the collection will initially belong to the same chunk on a single 
    shard. Although the system will eventually divide this chunk and migrate its contents 
    to distribute data more evenly, at any moment the cluster can only direct insert 
    operations at a single shard. This can limit the throughput of inserts. If most of 
    your write operations are updates, this limitation should not impact your performance. 
    However, if you have a high insert volume, this may be a limitation.