Search code examples
nosqlamazon-dynamodbdynamodb-queriesnosql-aggregation

Why doesn't DynamoDB support simple aggregations?


  • Why doesn't DynamoDB support simple aggregations in the same way that mongoDB supports them?

  • From my admittedly limited knowledge of NoSQL databases, I would have thought it would be much more efficient to do aggregations in a NoSQL DB than in a traditional RDS because of the distributed nature of these databases.

  • Surely it's more efficient to perform the aggregations in parallel on small chunks of data with a master shard doing a final aggregation. Rather than on an RDS where one machine performs the operation on an entire dataset

  • I understand that you can still perform aggregations client side. But it feels much less efficient and for large datasets you may be sending very large amounts of unnecessary data back to your client

So why doesn't dynamoDB support simple aggregations? Am I missing something here? Are these operations actually much better suited to RDS's?

Edit

To be clear, I understand that table joins become much more problematic in distributed databases. But as far as I can see, the same problems do not apply to aggregations

  • For example, in the below query, we join a the users table with the emails table. This query is much more efficient if all the data is on one node because we don't need to start taking users from one node and checking whether we can join them to emails on a different node

       SELECT * FROM users u INNER JOIN emails e ON u.user_id = e.user_id 
    
  • However, in the below aggregation query, the data could be spread across hundreds of nodes. The query could be processed efficiently on all nodes and then the final result would essentially be the sum of the subqueries. Essentially this is the idea behind the map reduce paradigm

       SELECT AVG(age) FROM users GROUP BY nationality
    

Solution

  • To answer my own question -

    DynamoDB aggregations

    • AWS want you to use DynamoDB streams to continuously calculate the aggregation metrics that you want to use. link

    • Essentially, they want you to create a separate table for every aggregation metric that you need and then update that table each time a record is created/updated/deleted

    • They claim that this method scales much better than the mongoDB aggregation pipeline. I guess this makes sense as DynamoDB is designed for very high read and write throughput. If you have large volumes of reads each doing an aggregation things will start to get jammed up very quickly.

    • Also, DynamoDB is not designed for ad-hoc querying. So the fact that you have to plan the aggregations you are going to need and keep an up to date table of these aggregations is only forcing you to use DynamoDB in the way that it is designed to be used. i.e your data should be shaped in the way that it is going to be accessed. Very interesting talk by the DynamoDB gurus here explaining all of this

    Hive Aggregations

    • You can alternatively use Apache Hive to query DynamoDB. Hive allows you to use SQL like queries including aggregations. The aggregations will still take place on the client side. However, it uses a hadoop cluster to process these queries in small chunks accross multiple machines, which can make larger queries a lot faster. The only thing to watch for is the amount of read and write capacity you consume Integrating DynamDB with Apache Hive