Search code examples
azurepysparkazure-sql-databasedatabricksazure-databricks

Databricks write command to Azure SQL stuck - Very large table


We have an extremely large table (> 500 columns, millions of rows) and we're trying to write data from datalake into Azure SQL. The job is running forever and does not complete nor does it throw any error. Below is my code, the same code has worked on many other occasions. Any help appreciated, thank you in advance

# Write DataFrame to SQL database
df.write \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", "your_table_name") \
    .option("user", connection_properties["user"]) \
    .option("password", connection_properties["password"]) \
    .option("driver", connection_properties["driver"]) \
    .mode("append") \
    .save()  


# Get the number of partitions
num_partitions = df.rdd.getNumPartitions()
print("Number of partitions:", num_partitions) // Number of partitions: 600

Solution

  • Here is how you can do it.

    Code:

    df.write \
    .format("jdbc") \
    .option("driver", jdbc_properties["driver"]) \
    .option("url", jdbc_properties["url"]) \
    .option("dbtable", "L_Table") \
    .option("user", jdbc_properties["user"]) \
    .option("password", jdbc_properties["password"]) \
    .option("truncate", "true") \
    .option("batchsize", 10000) \
    .option("isolationLevel", "NONE") \
    .option("numPartitions", 10) \
    .mode("overwrite") \
    .save()
    
    • In the above code .option("batchsize", 10000) Specifies the batch size for writing the data. It controls the number of rows written in each batch.

    • .option("isolationLevel", "NONE") Sets the isolation level for the database connection. In this case, it is set to "NONE" which means no specific isolation level is enforced.

    • .option("numPartitions", 10) Specifies the number of partitions to be used for parallelism when writing the data. It helps improve performance by processing data in parallel across multiple tasks.

    • .mode("overwrite") Sets the write mode to "overwrite," which means it will overwrite the existing data in the table if it already exists.

    This code uses the JDBC format in PySpark to establish a connection with the Azure SQL database and write the DataFrame (df) to the specified table (L_Table) with the provided options and configurations.

    For example, I have created a SQLtable with 100 columns and rows are written from datalake into Azure SQL. The L_Table consist of have the data for 100 columns. enter image description here enter image description here

    Get the partition: enter image description here