Search code examples
apache-kafkakafka-consumer-apifile-writing

Avoid Data Loss While Processing Messages from Kafka


Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any exception/errors during processing the messages.

My use case is as below.

enter image description here

a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.

b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.

c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the same file.

Q-1) Is writing to file the only option to avoid data loss ?

Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor is build, it might be reading the messages from the same file while another application is trying to write to the file.

ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.


Solution

  • I've run into something similar before. So, diving straight into your questions:

    • Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.

    • I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.

    Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.

    One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.

    EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.

    enter image description here

    I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.

    I hope this helps!