Search code examples
apache-kafkatime-seriesstreamingapache-flink

Platform for stateful frame-by-frame processing


Modern stream processing engines tend to focus on parallel processing of big data. This is great when the desired results are calculated from aggregates over the entire set or subset(filtered) of the whole dataset. In contrast to this I need to process data sequentially, where a change from one row to the next has meaning and changes what has to be done next. All data is time-series data. As an example, imagine a video game generating the results according to the current situation and the inputs of the player. Spark kind of allows this with e.g. mapGroupsWithState(), but I believe it's not what it's designed for, as this case doesn't need distributing data across a cluster to do processing on(because as data needs to be processed sequentially,it should benefit from holding the state in one place and passing all data through a single point of processing close to that state). I've looked at flink but I didn't find anything related to stream processing frame-by-frame.

Is anything out there for solving this type of problems? I don't want to re-invent the wheel.

Thank you.


Solution

  • You can do event-at-a-time temporal pattern matching and time-series analysis with Flink.

    For an easier-to-use, higher-level API, look at the docs for doing pattern recognition with Flink SQL. For a more powerful pattern recognition library, see the docs for Flink's CEP library.

    If you prefer to work directly with the lower-level building blocks of stateful, event-at-a-time stream processing, then the best place to start in Flink is with the KeyedProcessFunction.

    Using your example of analyzing a stream of events from a video game, if you want to separately (and in parallel) process streams from different players, you would do something like this:

    events
      .keyBy(event -> event.playerId)
      .process(new MyKeyedProcessFunction())
      ...
    

    but if you can't meaningfully key-partition the stream, then you can do this:

    events
      .keyBy("a constant")
      .process(new MyKeyedProcessFunction())
    

    The reason you probably want to use a KeyedStream even if you can't take advantage of parallelism is that Flink's keyed state and timers are easier to work and more flexible than non-keyed state.

    See the process function docs for more information.