Search code examples
apache-sparkspark-structured-streaming

Spark structured streaming real-time aggregation


Is it possible to output aggregation data on every trigger, before the aggregation time window is over?

Context: I'm developing an application that reads data from a Kafka topic, processes the data, aggregates it over a 1-hour window, and outputs to S3. However, The spark application understandably outputs the aggregation data to S3 only at the end of a given hour window.

The problem is that the end-users of the aggregated data in S3 can only have a semi real-time view, since they are always one hour late, waiting for the next aggregation to be outputted from the spark application.

Reducing the aggregation time window to something smaller than an hour would certainly help, but would generate a lot more data.

What could be done to enable real-time aggregation, as I call it, using minimal resources?


Solution

  • This is an interesting one and I do have a proposal but I'm not sure if this would really fit your minimal criteria. I'll describe the solution anyway...

    If the end goal is to enable users to query data in real-time (or faster analytics in other words) then one way to achieve this is to introduce a database in your architecture that can handle fast inserts/updates - either a key-value store or a column oriented database. Below is a diagram that might help you in visualising this:

    enter image description here

    The idea is simple - just keep ingesting data into the first database and then keep offloading the data into S3 after a certain time i.e. either an hour or a day depending on your requirements. You could then register the metadata of both of these storage layers into a metadata layer (such as AWS Glue) - this may not always be necessary if you don't need a persistent metastore. On top of this, you could use something like Presto to query across both of these stores. This would also enable you to optimise your storage across these 2 data stores.

    You'll obviously need to build the process to drop/delete the data partitions from the store you would be streaming in to and also to move data to S3.

    This model is referred to as a tiered storage model or hierarchical storage model with sliding window pattern - Reference Article from Cloudera.

    Hope this helps!