Search code examples
scalaapache-sparkpysparkazure-synapse

PySpark Equivalent of synapsesql in ASA


When working with a Spark notebook for Azure Synapse Analytics (ASA), I can use Scala to save a CSV file as a table in a Dedicated SQL Pool with two simple statements:

%%spark
// [1] - https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
// [2] - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri
val testDF = spark.read.format("csv").option("header", "true").load(
    "abfss://[email protected]/TIOBE-azure-backup.csv"
);
// [3] - https://www.aizoo.info/post/dropping-a-sql-table-in-your-synapse-spark-notebooks
// [4] - https://stackoverflow.com/questions/67907984/write-data-to-sql-dw-from-apache-spark-in-azure-synapse
testDF.write.mode("overwrite").synapsesql("eary_dedicated_test_sql_pool.dbo.TIOBE_test");

Unfortunately, [3] from above seems to imply the synapsesql function does not exist for PySpark. Has this situation changed since Darren last updated his post?

NOTE: I do not want to mess with configurating the Azure Synapse Dedicated SQL Pool Connector for Apache Spark. My dedicated SQL pool is in the same workspace as my Spark Pool so I don't feel like I should have to hassle with configuring a bunch of authentication options:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export?tabs=scala%2Cscala1%2Cscala2%2Cscala3%2Cscala4%2Cscala5

EDIT: The following pyspark code gives me
"AttributeError: 'DataFrameWriter' object has no attribute 'synapsesql' "
on Line 7

%%pyspark
df = spark.read.load('abfss://[email protected]/TIOBE-azure-backup.csv', format='csv'
## If header exists uncomment line below
, header=True
)

# [5] - https://stackoverflow.com/questions/69720753/write-dataframe-to-sql-dedicated-database-using-synapse-analytics
df.write.mode("overwrite").synapsesql("eary_dedicated_test_sql_pool.dbo.TIOBE_test")

Solution

  • Python support for synapsesql has been here for about a year now. Just add the imports as per the docs:

    # Add required imports
    import com.microsoft.spark.sqlanalytics
    from com.microsoft.spark.sqlanalytics.Constants import Constants
    from pyspark.sql.functions import col
    
    # Get the table with synapsesql method and expose as temp view
    df = spark.read.synapsesql("sandpit_ded.dbo.nation")
    
    df.createOrReplaceTempView("vw_nation")
    

    Python Notebook using synapsesql method