Search code examples
azure-data-exploreradxapache-spark-connector

Spark-Kusto connector writestream writes only 1 batch and then goes idle


We are trying to acheive a low latency strcutured streaming ingestion into ADX(Azure data explorer) from Databricks using PySpark writestream with open source Spark-Kusto connector

  • We stream a small volume of data <100 MB of data per second to ADX but the ingestion goes idle after processing 1 batch.
  • Sometimes, it does write, but the latency is in minutes. And these both behaviours doesn't seems to have a pattern.

Configurations we enabled & tests we performed so far

  1. Defined the low latency cofiguration in ADX tables (on both db and table level)
  2. Enabled streaming policy to be true
  3. Even increased the ADX cluster size to be sure
  4. Stream write to Object storage to make sure streaming is working, it works.
  5. However the managed ingestion "data ingest" native to ADX is ingesting the data from Event-Hub with ms latency.

Connector(Maven): com.microsoft.azure.kusto:kusto-spark_3.0_2.12:5.0.4

Writestream Code:

options = {
    "kustoCluster": f"{kusto_cluster}",
    "kustoDatabase": f"{kusto_db}",
    "kustoTable": f"{table}",
    "kustoAadAppId": f"{KUSTO_AAD_APP_ID}",
    "kustoAadAppSecret": f"{KUSTO_AAD_APP_SECRET}",
    "kustoAadAuthorityID": f"{KUSTO_AAD_AUTHORITY_ID}",
    "writeMode" : "Queued",
    "clientBatchingLimit":"100"
}

kust_stream = (df
                .writeStream
                .queryName("ADX_WRITE")
                .format("com.microsoft.kusto.spark.datasink.KustoSinkProvider")
                .options(**options)
               )

kust_stream.start().awaitTermination()

Expectation: Low latency write is expected using Spark-Kusto connector (ms latency)

What could be the reason behind this problem?

https://github.com/Azure/azure-kusto-spark/pull/301 - This PR says the functionality is yet to be accepted. Does anyone know the timeline for the release? or any beta version to try the write-mode "stream" ingest.


Solution

  • We ran into the same issue. Below is the behaviour we noticed.

    Scenario:

    1. We wrote about 5k records into the destination table.
    2. The data was written in a couple of seconds. We verified this by looking at the ingestion_time() of the destination record.
    3. However, the spark microbatch was taking 5 minutes.

    Observations:

    The only possible explanation for this was, though the write happened in a couple of seconds, it was not returning the control back to spark and timing out after 5 minutes. Post this, the next microbatch would start and the cycle would repeat.

    Updating the streaming ingestion to true on the table didn't work for us as well. The only other way for kusto to ingest was to break the data into smaller batches and ingest. This brings me to the policy that affects data ingestion in smaller batches.

    Solution:

    Kusto ingestion batching policy

    Updating the ingestion batching policy on the table solved it! This did make sense and the 5 minute latency on spark per microbatch was the default value on the ingestion batching policy. Reducing this number to under 10sec (This was for our our use case. You need to arrive at a number for yours) did it.

    Hope this helps!