Search code examples
hadoopamazon-web-servicesamazon-s3amazon-redshiftelastic-map-reduce

Best technology stack for aggregation across various properties


We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.

With regards to this we are evaluating the following options:

  • EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job

  • Putting recent data into AWS- RDS, computing the aggregates via sql

  • Akka: It is a framework to build distributed applications via Actors and Message passing.

    If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.


Solution

  • The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.

    Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.