Search code examples
mongodbbigdatasharding

Need help to select sharding key in MongoDB


For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.

I have two potential fields which can be used as Sharding Key:

For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.

For query it is different.

  • Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.

  • Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.

Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.

Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).


Solution

  • In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.

    For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.

    When you insert a document it will have a shard key and the document will be stored on a designated shard.

    Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.

    The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.

    For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.

    If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.

    Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.

    When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.

    Which one is the better field to be used as Sharding Key?

    It can be concluded that the Field(1) must be used as a shard key.

    See documentation on shard keys and choosing a shard key @ MongoDB docs on Shard Keys.