Why doesn't DynamoDB support simple aggregations in the same way that mongoDB supports them?
From my admittedly limited knowledge of NoSQL databases, I would have thought it would be much more efficient to do aggregations in a NoSQL DB than in a traditional RDS because of the distributed nature of these databases.
Surely it's more efficient to perform the aggregations in parallel on small chunks of data with a master shard doing a final aggregation. Rather than on an RDS where one machine performs the operation on an entire dataset
I understand that you can still perform aggregations client side. But it feels much less efficient and for large datasets you may be sending very large amounts of unnecessary data back to your client
So why doesn't dynamoDB support simple aggregations? Am I missing something here? Are these operations actually much better suited to RDS's?
To be clear, I understand that table joins become much more problematic in distributed databases. But as far as I can see, the same problems do not apply to aggregations
For example, in the below query, we join a the users table with the emails table. This query is much more efficient if all the data is on one node because we don't need to start taking users from one node and checking whether we can join them to emails on a different node
SELECT * FROM users u INNER JOIN emails e ON u.user_id = e.user_id
However, in the below aggregation query, the data could be spread across hundreds of nodes. The query could be processed efficiently on all nodes and then the final result would essentially be the sum of the subqueries. Essentially this is the idea behind the map reduce paradigm
SELECT AVG(age) FROM users GROUP BY nationality
To answer my own question -
AWS want you to use DynamoDB streams to continuously calculate the aggregation metrics that you want to use. link
Essentially, they want you to create a separate table for every aggregation metric that you need and then update that table each time a record is created/updated/deleted
They claim that this method scales much better than the mongoDB aggregation pipeline. I guess this makes sense as DynamoDB is designed for very high read and write throughput. If you have large volumes of reads each doing an aggregation things will start to get jammed up very quickly.
Also, DynamoDB is not designed for ad-hoc querying. So the fact that you have to plan the aggregations you are going to need and keep an up to date table of these aggregations is only forcing you to use DynamoDB in the way that it is designed to be used. i.e your data should be shaped in the way that it is going to be accessed. Very interesting talk by the DynamoDB gurus here explaining all of this