Search code examples
mongodbsharding

MongoDB compound shard key strategy


I have documents like: {_id: "someid1", "bar": "somevaluebar1"} {_id: "someid2", "foo": "somevaluefoo2", "bar": "somevaluebar2"} {_id: "someid3", "foo": "somevaluefoo3", "zoo": "somevaluezoo3"} {_id: "someid4", "zoo": "somevaluezoo4"}

1. If we query documents by "foo" the most and "bar" the second, does it make sense to create a compound shard key like { "foo" : 1, "bar" : 1, "_id" : 1 }?

2. "foo" or "bar" can be missing from the document too so I added "_id" to the compound shard key. Is this a good decision?

3. What will happen if I query by "bar"? Does it hit all the shards to gather the result?


Solution

  • Note: Some issues with the answer, too old, can't debug.

    If required, can be removed as a correct answer, not sure about the Stackoverflow policies.

    It is important to understand while creating shard keys, that they need to have different property compared to regular keys (primary or secondary). Usually, shards group data chunks containing similar shard key values (i.e. values residing in a particular range.). So, a good shard key should not be monotonically increasing.

    In case of compound shard key, since it includes "_id" field it would be unique for each document. Therefore, it makes a bad choice for shard key as all the chunks would be dumped on one shard. This can be overcome by using hashed keys as follows,

    sh.shardCollection("<your-db>", {{ "foo" : 1, "bar" : 1, "_id" : 1 }:"hashed"})

    Now, addressing the individual questions you raised. 1. Since you query your data more on "foo" than "bar", it makes sense to make "foo" as the shard key. In case, "foo" is uniformly distributed it need not be hashed.

    1. You can edit the "foo" field with some dummy value, then you would not need to ad "_id" in the compound shard key. However, if the number of empty "foo" field is high, you can use a range of dummy values.

    2. If you only create shards based on "foo", when you query using "bar" all the shards would be hit to gather result.

    tl;dr. If frequency of operations on "foo" is significantly higher that "bar" based operations, and "foo" is evenly distributed accorss all the rows then, foo can be used as a shard key, and dummy value(s) can be used to fill the missing "foo" values.