Search code examples
scalaapache-sparkspark-structured-streamingazure-eventhub

Skipping of batches in spark structured streaming process


I have got a spark structured streaming job which consumes events coming from the azure event hubs service. In some cases it happens, that some batches are not processed by the streaming job. In this case there can be seen the following logging statement in the structured streaming log:

INFO FileStreamSink: Skipping already committed batch 25

the streaming job persists the incoming events into an Azure Datalake, so I can check which events have actually been processed/persisted. When the above skipping happens, these events are missing!

It is unclear to me, why these batches are marked as already committed, because in the end it seems like they were not processed!

Do you have an idea what might cause this behaviour?

Thanks!


Solution

  • I could solve the issue. The problem was that I had two different streaming jobs which had different checkpoint locations (which is correct) but used the same base folder for their output. But in the output folder there is also saved meta information and so the two streams shared the information which batches they had already committed. After using a different base output folder the issue was fixed.