apache-flink flink-streaming system-design

Complex Flink streaming topology

My use case is a bit unique and I want some help to see how or if I can do this through Flink. I have a stream of data coming in from a Kafka topic, and within it there is a field (ex. user id, event id etc) that I will need to query other destinations via RPC to retrieve the values for, so I can join them and write into a sink.

Now, the Kafka source is easy enough, as Flink already provides API functions to consume from Kafka, however, my problem relies in the external sources. I've thought about implementing a custom source for those sources, but without the field from the Kafka message I wouldn't know what to query, so that doesn't quite make sense.

In addition to that, the Kafka messages are likely to contain lots of duplicate data, so I also want to cache some results so that I'm not making duplicate calls.

Finally, some rate limiting feature is also desired on the other sources as their respective services may not be able to handle incredible high traffic.

What would a topology design look like in this case? I've thought of using a KeyedProcessFunction to process the messages, call the other services for their fields are store the value state, however I am not too sure how I would achieve rate limiting through this approach.

Solution

Typically you'd use Flink's Async I/O support to query the external system in a way that's multi-threaded and non-blocking.

The caching of results is a bit trickier, as you can't (easily) do iterations in Flink, which is what would be needed to have the results of the async query be available upstream to split off existing results before making an async call.

You can emulate a loop by writing the result of the async call to a Kafka topic, and use that same topic as input that you join with the incoming data stream to detect "recent enough" results, for when you can avoid making the call.