Related to this question, I'm looking for more a more specific answer. In an effort to keep this non-subjective, here is a full thought process for creating an activities table with a stuck point that can be finished with a quick example answer.
In an effort to better understand DynamoDB, I'm creating a personal website that contains an activity feed from a DynamoDB table. The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
Different types of activities will include blog posts, projects, twitter post references, LinkedIn post references, etc. Using the activity type as a partition key would not be wise as my activity is highly weighted, mostly on the twitter side, hardly ever creating blog posts.
A unique activity id seems to be the best option for evenly distributing activities across DynamoDB partitions. However, this completely removes the ability to sort activities to start, as queries require a partition id to be known first. This is where a secondary global index (SGI) will be helpful. With this, a sort key will not be required on the primary partition key, but paired in an SGI.
This is part where I'm stuck. What do I base the SGI partition key on? At the moment I'm thinking of a single value "activity" for all activities with a sort key of "date", but that is a single partition for all entries. Will a single SGI partition key value limit performance in this project?
Note that this is a small scale project. However, I'm thinking about large scale projects while building this one, attempting to create the best DynamoDB table possible in regards to optimized partition distribution, while still keeping it flexible for sorting all table records.
Consider GSI (Global Secondary Index) same as Main Table indexes while designing your schema as they also get Read/Write provisioning limits and are subject to hot partition throttling as well which back pressures on main table in other words if your GSI gets throttled then your main table will start throttling requests.
Will a single SGI partition key value limit performance in this project?
Single partition for complete table is definitely misuse of DDB scalable capability.
The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
You can sort across partitions using GSI but you will again need partition key for your GSI and if that partition key is not distributed enough then you get into problems I mentioned above.
DDB is powerful for put/get operations if modeled right and for fairly simple queries with some filters. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
For your specific need its not directly possible to get scalable solution from DDB but we still have few options
Option 1:
We can model the data such that it is fairly distributed for writes and will need extra work while reading it back, this pattern is also known as Randomizing Across Multiple Partition Key Values. Since you don't want to access specific item for given time this will work for us.
Idea is to create fixed set (say 1 to 100) and randomly pick a number from it to append to creation date (not timestamp) and have creation timestamps as sort key.
This will distribute your load across multiple random partitions but increases the read complexity as you will need to query all partitions and merge to get final sort view for that date.
Option 2:
Use multiple tables for hot and cold data as it is time series based data. For info read
Option 3:
Scan? Not a good choice if we talk about scalability and when your data grows but for fairly small set of data it surely helps so mentioning it.
These are just an example not saying a good fit for your usecase. So here is a thought process question for you: write down all your use-cases and access patterns. Figure out their importance which are fine with eventual consistency which are not and see if DDB is good fit for them at first place, don't be tempted to use DDB and then struggling with access pattern scalability.
Also read for more questions you must be asking yourself before restricting yourself for specific access pattern you want from DDB.
Don't forget to read best practices: