Search code examples
apache-sparkpysparkazure-sql-databasedatabricks

writing spark df to azure sql server with clustered columnstore index and PK/FK


Consider the following usecase:I want to write a spark data frame to a dwh stored in a azure sql server db using Microsoft’s Apache Spark SQL Connector. Either with "overwrite" or "append" depending on the usecase.

For dwhs, clustered columnstore indexes are recommended. As I also want primary key constrains to be enforced, Nonclustered indexes on a clustered columnstore index probably would yield the best performance.

How do I create or append to tables with pyspark using this index design?

df.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .mode("overwrite OR append") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", sqldbuser) \
        .option("password", sqldbpwd)

The answer does not have to be purely pyspark based (e.g. can use sqlalchemy). It should just work robustly.


Solution

  • Your best approach would be to insert the dataframe from pyspark into a staging table on Azure SQL server (with overwrite), then use a stored procedure to insert the data into your columnstore index dwh table.

    In the stored procedure, use something like that:

        insert into dwh_table
        select *
        from stg_table
    

    Also, try inserting multiple rows at once so that the columnstore index can compress it efficiently (over 1,048,576 rows).

    About nonclusterd index - if you need it as a primary key, to prevent duplicates, then you have no choice but to have it active during load.

    If not, try to disable nonclustered indexes before load (in your stored procedure), and enable them after load. Please note that this does not always give you the best performance, you'll have to test it with your workload.

    Please refer to this document: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16