Search code examples
azurepysparkazure-synapse

how to update rows in data frame in pyspark , the table is in serverless SQL pool in Azure Synapse


I am trying to update few rows in dataframe using pyspark , the table exist in Serverless SQL pool

I did try yo use Update command but the error is it is not supported in serverless SQL pool, is there any way possible update few row in Azure synapse Serverless SQL Table


Solution

  • If you are updating tables which are created using ADLS data source in serverless pool as per this Serverless SQL pools don't support updating Delta Lake files. You can use serverless SQL pool to query the latest version of Delta Lake. Use Apache Spark pools in Synapse Analytics to update Delta Lake.

    Read the data from path with below code:

    df = spark.read.load('abfss://<containerName>@<ADLSName>.dfs.core.windows.net/mycsv2.csv', format='csv',header=True)
    

    enter image description here

    Update the data according to the requirements using below code:

    from pyspark.sql.functions import when, col
    df = df.withColumn('<colName>', when(<condition>, '<updateValue>').otherwise(df.<colName>))
    

    Create a linked service of ADLS. And write the data frame into ADLS with below code:

        import pandas as pd
        pdf = df.toPandas()   
        pdf.to_csv('abfss://<filepath>/mycsv2.csv', storage_options = {'linked_service' : '<linkedserviceName>'})
        print("done")
    

    complete code:

    from pyspark.sql.functions import when, col
    import pandas as pd
    
    df = spark.read.load('abfss://rakeshsynapse@rakeshgen2.dfs.core.windows.net/mycsv2.csv', format='csv',header=True)
    df = df.withColumn('Name', when(col('Id') == 1, 'AE').otherwise(df.Name))
    pdf = df.toPandas()
    pdf.to_csv('abfss://rakeshsynapse/mycsv2.csv', storage_options = {'linked_service' : 'AzureDataLakeStorage1'})
    print("done")
    

    enter image description here

    The data will be update successfully.