Search code examples
performancespark-streamingspark-structured-streamingazure-eventhub

Spark Structured Streaming - WAL performance degradation


We have a spark structured streaming query that reads data from eventhub, does some processing and write data back to eventhub. We have checkpointing enabled - we store the checkpoint data in the Azure Datalake Gen2.

When we run the query, we see something weird - over time, our query's performance (latency) slowly degrades. When we run the query for the first time, the batch duration time is ~3 secs. After a day of run, the batch duration time is 20 secs and after 2 days, we get to a 40 secs+.. Interestingly, when we delete the checkpoint folder (or otherwisely reset the checkpoint), the latency goes back to normal (2 secs).

Looking at the query performance after 2 days of running on the same checkpoint directory, it is quite clear that it is the write-ahead-log / "walCommit", which grows and after some time accounts for the majority of the processing time.

enter image description here

My questions are: what drives this behaviour - is it natural for walCommit to take longer and longer? Could it be Azure Datalake Gen2 specific? Do we even need write-ahead-logs for eventhub? What are general ways how to improve this (not assuming disabling the WAL)..


Solution

  • Thanks @tomas-bartalos for the answer!

    We found another issue, that was the real cause of our problem - properties of Azure Gen2 Storage (with hierarchical namespace enabled). It seems Azure Gen2 is slow when listing a lot of files. We tried to open the streaming checkpoint directory using the Azure Explorer and it took about 20 seconds (similar to the walCommit time). We switched to the Azure Blob Storage and the problem was gone. We haven't done anything with the crc files (tomas's answer) so we concluded that te storage mode was the main issue.