Search code examples
apache-sparkazure-sql-databaseazure-databricksazure-elasticpool

Does the performance of inserting data into Azure SQL database from Databricks affected by the sizing of the database?


I am now working on a use case which would need to ingest huge data (~10M of rows) from Azure Databricks materialized view to an Azure SQL database. The database is using the elastic standard (eDTU 50) as the pricing tier. I have already implemented various optimization measures on the Databricks side but the spark job is not running at all! This makes me wonder whether the bottleneck is actually on the database instead of the spark config.

try:
    df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", sql_db_url) \
    .option("dbtable", target_table) \
    .option("user", username) \
    .option("password", password) \
    .option("batchsize", "100000") \
    .option("tableLock", "true") \
    .option("schemaCheckEnabled", "false") \
    .option("reliability Level", "BEST_EFFORT") \
    .save()

    print("Successfully write data into target SQL database")
except Exception as error:
    print("An exception occurred:", error)

enter image description here (Whenever running the inserting statement on Databricks, the CPU utilization of DB hits 100%)

Appreciate for any advice

Tried various optimization measures in Databricks and also different cluster sizing.


Solution

  • Like Anupam said, your bottleneck is very likely the 50 eDTU for such a workload. That graph you have there is showing only DTU, and it is spiking to 100%; to get a better understanding replace the DTU metric with CPU, Data IO and Log IO (DTU is mostly composed by these). Log IO is likely your bottleneck, since you are dumping data into the database. If you also see high CPU and Workers, then you could also be missing indexes.