apache-spark amazon-dynamodb spark-streaming amazon-dynamodb-streams

How to use Apache Streaming with DynamoDB Stream

We have a requirement wherein we log events in a DynamoDB table whenever an ad is served to the end user. There are more than 250 writes into this table per sec in the dynamoDB table.

We would want to aggregate and move this data to Redshift for analytics.

The DynamoDB stream will be called for every insert made in the table i suppose. How can I feed the DynamoDB stream into some kind of batches and then process those batches. Are there any best practices around such kind of use cases ?

I was reading about apache spark and seems like with Apache Spark we can do such kind of aggregation. But apache spark stream does not read the DynamoDB stream.

Any help or pointers is appreciated.

Thanks

Solution

DynamoDB streams have two interfaces: low-level API, and Kinesis Adapter. Apache Spark has a Kinesis integration, so you can use them together. In case if you are wondering what DynamoDB streams interface you should use, AWS suggests that Kinesis Adapter is a recommended way.

Here is how to use Kinesis adapter for DynamoDB.

Few more things to consider:

Instead of using Apache Spark it is worth looking at Apache Flink. It is a stream-first solution (Spark implements streaming using micro-batching), has lower latencies, higher throughput, more powerful streaming operators, and has support for cycling processing. It also has a Kinesis adapter
It can be the case that you don't need DynamoDB streams to export data to Redshift. You can export data using Redshift commands.