Search code examples
apache-kafkadistributed-systemstream-processing

Can Kafka streams deal with joining streams efficiently?


I'm new to Kafka and I'd like to know if what I'm planning is possible and reasonable to implement.

Suppose we have two sources, s1 and s2 that emit some messages to topics t1 and t2 respectively. Now, I'd like to have a sink which listens to both topics and I'd like it to process tuples of messages <m1, m2> where m1.key == m2.key.

If m1.key was never found in some message of s2, then the sink completely ignores m1.key (will never process it).

In summary, the sink will work only on keys that s1 and s2 worked on.

Some traditional and maybe naive solution would be to have some sort of cache or storage and to work on an item only when both of the messages are in the cache.

I'd like to know if Kafka offers a solution to this problem.


Solution

  • Most modern stream processing engines, such as Apache Flink, Kafka Streams or Spark Streaming can solve this problem for you. All three have battle tested Kafka consumers built for use cases like this.

    Even within those frameworks, there are multiple different ways to achieve a streaming join like the above. In Flink for example, one could use the Table API which has a SQL-like syntax.

    What I have used in the past looks a bit like the example in this SO answer (you can just replace fromElements with a Kafka Source).

    One thing to keep in mind when working with streams is that you do NOT have any ordering guarantees when consuming data from two Kafka topics t1 and t2. Your code needs to account for messages arriving in any order.

    Edit - Just realised your question was probably about how you can implement the join using Kafka Streams as opposed to a stream of data from Kafka. In this case you will probably find relevant info here