Search code examples
apache-sparkscalatestspark-structured-streaming

How to test streaming window aggregations?


I read some data off Kafka and perform window aggregations over it in the following manner:

inputDataSet
      .groupBy(functions.window(inputDataSet.col("timestamp"), "1 minutes"),
        col("id")))
      .agg(count("secondId").as("myCount")) 

How do I unit test this piece of code? All examples over the net are around Dstreams, but I am using a data set by loading from Kafka in this manner:

sparkSession.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", Settings.KAFKA_URL)
      .option("subscribe", Settings.KAFKA_TOPIC1)
      .option("failOnDataLoss", false)
      .option("startingOffsets", "latest").load()
      .select(from_json(col("value").cast("String"), mySchema).as("value"))

and passing this dataset to my aggregation function


Solution

  • I haven't heard of a testing framework that would help you with the testing of a streaming Datasets out of the box (beside the test suite in Spark itself that it uses internally).

    If I were to test a streaming Dataset I'd not use the Kafka source since it does not add much to the testing yet introduces another layer where things may get wrong.

    I'd use memory source and sink. I'd also review the tests for Structured Streaming in the Spark repo, esp. StreamingAggregationSuite since you asked about windowed streaming aggregates. Use the tests as a set of guidelines that you could employ for your own tests.