Search code examples
mongodbsharding

Defining a shard-key automatically for a dynamic collection and suggestion on the design


I want to implement Sharding for my MongoDb and need some of your suggestions.

Insight

  1. We have lots of cron-job which collects various information about a machine & writes them to it's own collection.
  2. Collections are created dynamically.
  3. Each collection has millions of data.
  4. Structure1 for each collection is Name, Category, Subcategory, NodeId, Process-Start-Time, Process-End-Time, Value.
  5. Structure2 for each collection is Name, Category, Subcategory, Subtype, Date, Value.
  6. Structure3 for each collection is Name, Category, Subcategory, NodeId, Process-Start-Time, Process-End-Time, Value, Flag1, Flag2, Flag3.

After a research we found we will use sharding and make it useful with multiple servers which guarantees two things:

  1. Need not worry on running out of space.
  2. Balanced performance across servers

Question 1: My problem is to find a correct shard-key to partition the data. I don't see a unique-key in the collection other than the default ObjectId. After further reading I have found that it is possible to use a composite key, does it make sense to have a composite key or custom ObjectId as a key where the value might look like ObjectId: _. This is very key with respect to performance of returning the results of a query & moving the chunks.

Question 2: Since we have large collections, it will become difficult to set the shard each time in Mongo console when a collection is created dynamically. Is there any way to make it automatic in mongo so that whenever a collection is created for a shard-database, it will define the shard-key for that collection?

Question 3: Is it really necessary to pass shard-key to the query expression? I don't think we have used ObjectId in any of our query-expression, I doubt I can come with a unique ID due to fact that the data is not structured like a traditional DB. If yes, how is it going to help for a query like this:

Example:

{ category: "Energy", subcategory: "Watt", Process-Start-Time: {$gte: 132234234}}

Thanks in advance for stepping in and helping me fix this problem.


Solution

  • The easiest way to do this might be to shard the database, but leave the collections unsharded. Benefits:

    • Collections will be distributed across the shards (but each collection will only live on one shard). EDIT: I was wrong about this, this isn't implemented yet. See the related Jira ticket to track. For now, you can use tags to distribute collections, but not automatically.
    • No need to call shardCollection on each new collection

    The downside is that all traffic for a collection will go to its shard, which might be impractical for what you're trying to do.

    As far as your questions:

    Question 1: Shard key does not have to be unique. What are you generally querying for? You might be better of with something like {category:1} or {category:1,subcategory:1}.

    Question 2: No built-in way to do it automatically, the best way to get that behavior is probably to set up a cron job.

    Question 3: No. Queries containing the shard key can be sent to specific shards and queries without the shard key must be sent to all shards, see http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-OperationTypes.