We are trying to acheive a low latency strcutured streaming ingestion into ADX(Azure data explorer) from Databricks using PySpark writestream with open source Spark-Kusto connector
Configurations we enabled & tests we performed so far
Connector(Maven): com.microsoft.azure.kusto:kusto-spark_3.0_2.12:5.0.4
Writestream Code:
options = {
"kustoCluster": f"{kusto_cluster}",
"kustoDatabase": f"{kusto_db}",
"kustoTable": f"{table}",
"kustoAadAppId": f"{KUSTO_AAD_APP_ID}",
"kustoAadAppSecret": f"{KUSTO_AAD_APP_SECRET}",
"kustoAadAuthorityID": f"{KUSTO_AAD_AUTHORITY_ID}",
"writeMode" : "Queued",
"clientBatchingLimit":"100"
}
kust_stream = (df
.writeStream
.queryName("ADX_WRITE")
.format("com.microsoft.kusto.spark.datasink.KustoSinkProvider")
.options(**options)
)
kust_stream.start().awaitTermination()
Expectation: Low latency write is expected using Spark-Kusto connector (ms latency)
What could be the reason behind this problem?
https://github.com/Azure/azure-kusto-spark/pull/301 - This PR says the functionality is yet to be accepted. Does anyone know the timeline for the release? or any beta version to try the write-mode "stream" ingest.
We ran into the same issue. Below is the behaviour we noticed.
Scenario:
Observations:
The only possible explanation for this was, though the write happened in a couple of seconds, it was not returning the control back to spark and timing out after 5 minutes. Post this, the next microbatch would start and the cycle would repeat.
Updating the streaming ingestion to true on the table didn't work for us as well. The only other way for kusto to ingest was to break the data into smaller batches and ingest. This brings me to the policy that affects data ingestion in smaller batches.
Solution:
Kusto ingestion batching policy
Updating the ingestion batching policy on the table solved it! This did make sense and the 5 minute latency on spark per microbatch was the default value on the ingestion batching policy. Reducing this number to under 10sec (This was for our our use case. You need to arrive at a number for yours) did it.
Hope this helps!