I want to build a cluster to store logging data. Each document has several fields but these are the key ones:
The collection is expected to have over 1 billion documents so sharding and performance are important here.
Now nearly all high-frequency queries to the collection will have user_id
in them because the logs are displayed in the UI to each user uniquely. Most queries will be on user_id
+ identifier
. Some queries will be time bound. Some queries also use channel
but not all. user_id
is a monotonically increasing field.
I want to shard on hashed(user_id)
. One ideal index is {"user_id": 1, "identifier": 1, "timestamp": 1}
so I made that. I tried sharding on hashed(user_id)
but it did not work in this case and I realized user_id
must be of the same type. However, making an index of {"user_id": "hashed", "identifier": 1, "timestamp": 1}
is also impossible since compound keys with a hash are disallowed.
What is my best option here?
hashed(user_id)
so I can shard on it and then another index with {"user_id": 1, "identifier": 1, "timestamp": 1}
? I would incur a storage penalty here.user_id
even if it's monotonically increasing and instead shard on {"user_id": 1, "identifier": 1}
? I'm not sure if there are disadvantages here compared to simply sharding on hashed(user_id)
Note that MongoDB 4.4 allows for compound indexes with a single hashed field: https://docs.mongodb.com/manual/core/hashed-sharding/
If you can't easily upgrade to 4.4, given that storage pressure is high here with the large number of documents and that most queries will contain both user_id
and identifier
, sharding on {"user_id": 1, "identifier": 1}
sounds like the best option you have here. It will allow these queries to be fast at the expense of your other queries that need to search across all identifiers per user or time-based queries.
I'm not sure of a better solution on a version less than MongoDB 4.4.