Search code examples
apache-kafkastreamapache-flinkapache-kafka-streams

How to display intermediate results in a windowed streaming-etl?


We currently do a real-time aggregation of data in an event-store. The idea is to visualize transaction data for multiple time ranges (monthly, weekly, daily, hourly) and for multiple nominal keys. We regularly have late data, so we need to account for that. Furthermore the requirement is to display "running" results, that is value of the current window even before it is complete.

Currently we are using Kafka and Apache Storm (specifically Trident i.e. microbatches) to do this. Our architecture roughly looks like this:

enter image description here

(Apologies for my ugly pictures). We use MongoDB as a key-value store to persist the State and then make it accessible (read-only) by a Microservice that returns the current value it was queried for. There are multiple problems with that design

  1. The code is really high maintenance
  2. It is really hard to guarantee exactly-once processing in this manner
  3. Updating the state after every aggregation obviously has performance implications but it is sufficiently fast.

We got the impression, that with Apache Flink or Kafka streams better frameworks (especially from a maintenance standpoint - Storm tends to be really verbose) have become available since we started this project. Trying these out it seemed like writing to a database, especially mongoDB is not state of the art anymore. The standard use case I saw is state being persisted internally in RocksDB or memory and then written back to Kafka once a window is complete.

Unfortunately this makes it quite difficult to display intermediate results and because the state is persisted internally we would need the allowed Lateness of events to be in the order of months or years. Is there a better solution for this requirements than hijacking the state of the real-time stream? Personally I feel like this would be a standard requirement but couldn't find a standard solution for this.


Solution

  • You could study Konstantin Knauf's Queryable Billing Demo as an example of how to approach some of the issues involved. The central, relevant ideas used there are:

    1. Trigger the windows after every event, so that their results are being continuously updated
    2. Make the results queryable (using Flink's queryable state API)

    This was the subject of a Flink Forward conference talk. Video is available.

    Rather than making the results queryable, you could instead stream out the window updates to a dashboard or database.

    Also, note that you can cascade windows, meaning that the results of the hourly windows could be the input to the daily windows, etc.