Search code examples
databasemongodbperformanceindexingsharding

MongoDb - Duplicate index fields for sharding purposes?


I want to build a cluster to store logging data. Each document has several fields but these are the key ones:

  • user_id (medium cardinality)
  • identifier (this has an extremely high cardinality but is not guaranteed to be unique across users, it can be a UPC for example)
  • channel (low cardinality)
  • timestamp

The collection is expected to have over 1 billion documents so sharding and performance are important here.

Now nearly all high-frequency queries to the collection will have user_id in them because the logs are displayed in the UI to each user uniquely. Most queries will be on user_id + identifier. Some queries will be time bound. Some queries also use channel but not all. user_id is a monotonically increasing field.

I want to shard on hashed(user_id). One ideal index is {"user_id": 1, "identifier": 1, "timestamp": 1} so I made that. I tried sharding on hashed(user_id) but it did not work in this case and I realized user_id must be of the same type. However, making an index of {"user_id": "hashed", "identifier": 1, "timestamp": 1} is also impossible since compound keys with a hash are disallowed.

What is my best option here?

  • create one index with just hashed(user_id) so I can shard on it and then another index with {"user_id": 1, "identifier": 1, "timestamp": 1}? I would incur a storage penalty here.
  • don't hash the user_id even if it's monotonically increasing and instead shard on {"user_id": 1, "identifier": 1}? I'm not sure if there are disadvantages here compared to simply sharding on hashed(user_id)
  • some other option?

Solution

  • Note that MongoDB 4.4 allows for compound indexes with a single hashed field: https://docs.mongodb.com/manual/core/hashed-sharding/

    If you can't easily upgrade to 4.4, given that storage pressure is high here with the large number of documents and that most queries will contain both user_id and identifier, sharding on {"user_id": 1, "identifier": 1} sounds like the best option you have here. It will allow these queries to be fast at the expense of your other queries that need to search across all identifiers per user or time-based queries.

    I'm not sure of a better solution on a version less than MongoDB 4.4.