Search code examples
serverlessazure-synapseexternal-tablesincremental

Azure Synapse > Serverless SQL > UPSERT parquet files is possible?


I'm using an Azure Synapse Serverless SQL database.

I have a first copy activity that uses Azure SQL tables as sources and an Azure storage gen2, where I store .parquet files as sink. From these .parquet files, I use CETAS to create External tables in my Serverless SQL DB (for the context: I'm working with several Azure SQL databases, so these external tables will allow me to write cross db queries). In other words, this Serverless SQL DB is my ODS database.

I have then a second copy activity that identifies incremental changes in the source tables (using the SYS_CHANGE_VERSION of the corresponding CHANGETABLE of the source DB). This second copy activity also outputs .parquet files.

In the end, I have 2 parquet files: 1 with the full content of the source table + 1 with the content to insert or update. External tables in a Serverless SQL DB are only metadata, so no way to do DML operations on them, so my question would be: is there a way to "merge" my 2 parquet files into 1 single one (without duplicates of course) that I could use to recreate an updated external table?

Alternatively, I see that I can choose in the copy activity sink the copy method "Upsert" and provide a KeyColumn (PK of my table), but it doesn't work, saying that: "Message=INSERT operation is not allowed for this type of table." (which looks normal as the associated sink dataset is pointing on my external table, that is read only)

copy activity: copy activity

Any idea on how to solve this? thanks!


Solution

  • Since external tables in Azure Synapse Serverless SQL database are read-only, you cannot use the Upsert copy method to update the external table directly.

    • If the full load file and incremental files are in the same folder in ADLS, then while creating the external tables you can give ** in place of filename.

    Sample external table script:

    CREATE  EXTERNAL  TABLE [dbo].[external_table] (
    [PK] nvarchar(4000),
    [name] nvarchar(4000),
    [ingestion_time] nvarchar(4000)
    )
    WITH (
    LOCATION = '<folder-name>/**',
    DATA_SOURCE = <datasource-name>,
    FILE_FORMAT = <fileformatname>
    )
    

    This script will make sure to combine all the data under that folder.

    • When the old data loaded in the initial version got replaced with new data, It is necessary to take only the latest record.

    Example, File1 data:

    PK,name,ingestion_time
    1,Karikala,2023-05-01
    4,Kalyani,2023-05-01
    7,Sindhu,2023-05-01
    

    File2 data:

    PK,name,ingestion_time
    1,Aadhi,2023-06-01
    

    Here, File2 data has the updated record for PK=1. When the query select * from external_table is executed, all four records will be displayed. Thus, Create a view on top of this external table to display only the latest record for each primary key.

    Sample Query:

    with cte as(
    SELECT *,RANK() over (partition  by PK order  by [ingestion_time] desc) as  Rank  FROM [dbo].[external_table])
    select PK,name,[ingestion_time] from cte where  rank=1
    
    PK name ingestion_time
    1 Aadhi 2023-06-01
    2 Kalyani 2023-05-01
    3 Sindhu 2023-05-01

    This query returns the rows with the highest ingestion_time value for each PK value in the external_table.