Search code examples
bigdataapache-flinkapache-beamdataflowamazon-kinesis-analytics

Recalculate historical data using Apache Beam


I have an Apache Beam streaming project that calculates data and writes it to the database, what is the best way to reprocess all historical records after a bug fix or after changing the way it processes data without a big delay?


Solution

  • It is quite application dependent.

    For example, a straightforward approach if you are using Kafka (and all data is in there):

    • Stop and relaunch the job (or if you want no downtime at all, launch another job while the other keeps running) without using a savepoint:
      • Use a different Kafka consumer group to not mess with the existing pipeline
      • Set a new database as output to build its contents from scratch
      • Scale up the job so it finishes to reprocess as fast as possible
    • Switch the old database with the new one atomically
    • Scale back down the job