apache-kafka apache-flink apache-kafka-streams

Can Kafka Streams prevent data duplication when using filtering logic?

I have one flink streaming job as below.

source -> filter -> sink

I use the Kafka to consume the messages and then filter some records and remaining are going to be sent to the kafka topic. Note that the source and destination topics are on the same kafka brokers.

In this case, I have another one destination topic which is originated from source topic. For example, the messages is stored in the source topic as below.

source topic A : a,1,b,2,c,3,d,4,e,5...

After the process, the destination topic has the value as below by filtering the digit number.

dest topic B : a,b,c,d,e....

So my broker has two different topics which has duplication for alphabet types. I implemented logic above using flink and there is no problem. But I don't want to duplicate some values as shown in my example above.

So I think that KafkaStreams might have ability to reference some values instead of copying the filtering values to another topics.

If true, I am likely to change it from Flink to KafkaStream because I will save my disk space when very large number of records are used in production.

Since I don't have deep knowledge for KafkaStreams, I would appreciate it if you could give me any insight for my issue.

Thanks.

Solution

I don't understand where the duplication exists in the example provided. Both topics shown already have unique values. You cannot move/remove records from Kafka topics to other topics, only copy. But in order to prevent duplicates in stream processing, you need to track what's been seen already.

Both Flink and Kafka Streams allow you to persist data into RocksDB tables that can be used to lookup key-value pairs of data while consuming new events. So, storing all records you've seen as the key, you'll naturally build up unique values, which you then can compare against the current record being read

Disk usage will approximately be the same in either case. Kafka Streams does not require an external scheduler, however.

You could also instead try dumping and comparing data into Redis, as one example