bigdata apache-flink apache-beam dataflow amazon-kinesis-analytics

Recalculate historical data using Apache Beam

I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug fix or after changing the way it processes data without a big delay?

Solution

It is quite application dependent.

For example, a straightforward approach if you are using Kafka (and all data is in there):

Stop and relaunch the job (or if you want no downtime at all, launch another job while the other keeps running) without using a savepoint:
- Use a different Kafka consumer group to not mess with the existing pipeline
- Set a new database as output to build its contents from scratch
- Scale up the job so it finishes to reprocess as fast as possible
Switch the old database with the new one atomically
Scale back down the job