Let's say I have 2 Kafka topics, each with one partition.
I want to use a technology like i.e. Kafka Streams, Apache Flink or Spring Kafka to write data into a third topic by joining these two topics based on Kafka timestamp chronology. However, I'm concerned about the issue that might arise during the first run of such a program. For example, if there are 20 million messages in topic A (3 weeks old) and only 1000 messages in topic B (1 day old).
Any suggestions?
With Flink you can use watermark alignment to avoid having to do a lot of buffering of the source with more data.
Kafka Streams will merge the two streams, reading from whichever stream has the lower timestamp -- but on a best-effort basis.