Search code examples
apache-kafka-streams

Data get discarded while I am using kafkastream aggregate method


I recently encountered a thorny problem, while I am using kafkastream's TimeWindowedKStream aggregation method. The phenomenon was that I stopped my program for 5 minutes and then restarted it. I found a small part of my data was lost and got the following prompt, "Skipping record for expired window". All data are normal data that want to be saved, there is no large delay. What can I do to prevent data from being discarded ? It seems that kafkastream got a later time when it got observedstreamtime


Solution

  • The error message means that a window was already closed -- thus you would need to increase GRACE as pointed out by @groo. -- Data expiration is based on event-time so stopping your program and resuming is later should not change much.

    However, if there is a repartition topic before the aggregation, if you stop your program for some time, there might be more out-of-order data inside the repartition topic, because the input topic is read much faster than in the "live run" -- this increased unorder during catchup could be the issue.