I'm new in MongoDB and I'm reading the manual. I've understood what shard and chunk are (other distributed systems have similar concepts) but I'm struggling to undestand these two lines:
The smallest range a chunk can represent is a single unique shard key value. A chunk that only contains documents with a single shard key value cannot be split.
This is the link of documentation: data partitioning. Given the example provided by the documentation with a minKey = 0 and maxKey = 200 can anyone give me an example of a chunk that can be splitted and one that cannot be splitted? Especially how documents inside a chunk that is not splittable look like? I think that if x is the shard key and the chunk relative to range 175-200 is the smallest, so unsplittable, a document with x=180 will be insert in that unsplittable chunk. I'm wrong? What will happen for other types of key?
Let's assume that you have a collection of tweets that is sharded. For simplicity I'm going to use 'account_id' as the shard key (e.g. x
in your question). Note that this is a bad shard key for this use case for reasons we'll see soon.
The collection is sharded and the range of accounts_id
's are broken up into chunks that will be distributed across the shards. One chunk will refer to the account_ids from 175-200.
After some time, each of these accounts continue tweeting and the size of this chunk grows to a point where it split into two chunks: [175, 183]
and [184,200]
.
Going further, assume that there is an incredibly prolific user (lets say account_id: 180
) in this range that tweets non-stop. Eventually chunk splits will occur to the point where the this account is in a chunk all by itself, e.g. [180,180]
. The size of this chunk will continue to grow as more and more tweets are added to the collection, but the chunk cannot be split as the shard key is at its finest granularity, which is a single account_id. There may be a large number of documents corresponding to this chunk, but there is no way to split this chunk by filtering only on the account_id.
This specific case is why this may be an inadvisable shard key.
In comparison, suppose the collection is sharded based on tweet_id
. This value would theoretically be unique, so there isn't the risk of a single value growing the size of a chunk to where it can not be split.