Search code examples
amazon-dynamodbamazon-redshift

Cost for DynamoDB read and write for 8 million records per hour and Copy to Redshift


Can some please help me in this calculation. Is this the way we calculate for pricing ?

We have a stream which creates an average of 8 million records/ hour (2500 records/second) and we want to insert into DynamoDB using aws Lambda.

Each records have max of 450 Bytes in size. What I understand is for less than 1 KB , Dynamodb takes 1 Write capacity unit. So for 2500 records / second we need 2500 Write capacity unit.

Data get inserted to DynamoDB in real time. For analysis purpose , we need the same data to be in Redshift as well. So every 15 minutes we need to sync Dynamodb table to Redshift table. Business want to do Analytics on top Redshift cluster only.

To do a copy data from DynamoDB to redshift using redshift COPY command in 15 minutes I need approximate 10000 Read capacity unit.

For doing this it will cost approx USD 2400/month


More info

Existing architecutre

S3 => Lambda => Redshift : This works fine till lambda , when it comes to redshift , jobs are getting into Queue, and other Jobs are getting delayed. So we checked with Amazon Tech support : They clearly mention , for real time or near real time Injection , Redshift is not a good choose. Instead of Lambda we use EMR clusster , created manifest file ( adding multiple files) and did a redshift copy - Still the result was satisfied. Business dint want to increase the cluster right now.

Proposed architecture

S3 => Lambda => DynamoDB (as soon as file Landed )=> Every 15 miutes => Query data from Dynamo DB and Save to S3 => Redshift

The ask from business is how much cost is Every 15 minutes we are querying Dynamodb data with record count of 4 million . Save it to S3 Do a Redshfit Copy with that S3.

If I give a read unit as 10,000 - can the read from Dynamodb(4.5 million) completes in 7 minutes

10000 * 60(seconds) = 6,00,000 records

6,00,000 * 7 minutes = 4.2 million

My calculation is

7 minutes to read from Dynamo DB

3 minutes to write to S3

3 minutes to do redshift COPY

In short Business needs the data to be in Dynamodb as soon as file landed in S3. They can wait 15 minutes to reflect in Redshift.

Any other suggestion on this


Solution

  • You appear to have two requirements:

    • Load data from Amazon S3 into DynamoDB
    • Load data from Amazon S3 into Amazon Redshift

    S3 to DynamoDB

    It appears that you have this working fine, using Amazon S3 to trigger an AWS Lambda function that then loads the data into DynamoDB.

    S3 to Redshift

    There is no reason to copy data from DynamoDB to Redshift if it is exactly the same data that was originally loaded from S3. Simply use the same files to load the data into Redshift, but do it as a batch at regular intervals.

    The best way to load data into Amazon Redshift is in bulk via the COPY command. Therefore, I would recommend:

    • Use a Lambda function (potentially the same one as used for DynamoDB) to collate a list of all incoming files into Amazon S3. You could store this in a DynamoDB table (or somewhere else of your own choosing).
    • Use an Amazon CloudWatch Events schedule to trigger an AWS Lambda function at regular intervals (every 15 minutes) which will:
      • Create a manifest file of all files received since the last load
      • Run a COPY command in Redshift, pointing to the manifest file
      • Mark those files in DynamoDB as having been processed (so that they will not be included in future executions)

    Benefits of this method:

    • Separation of the two processes, which makes them easier to maintain
    • No need to read data from DynamoDB, which will reduce the Read Capacity Unit cost
    • You can vary the frequency of the load into Redshift as desired