Search code examples
sqlsql-serverdatabasetransactionsazure-synapse

Azure Synapse fastest way to process 20k statements in order


I am designing an incremental update process for a cloud based database (Azure). The only existing changelog is a .txt file that records every insert, delete, and update statement that the database processes. There is no change data capture table available, or any database table that records changes and I cannot enable watermarking on the database. The .txt file is structured as follows:

update [table] set x = 'data' where y = 'data'
go
insert into [table] values (data)
go
delete from [table] where x = data
go

I have built my process to convert the .txt file into a table in the cloud as follows:

update_id | db_operation | statement                                       | user  | processed_flag
----------|--------------|-------------------------------------------------|-------|---------------
1         | 'update'     | 'update [table] set x = data where y = data'    | user1 | 0
2         | 'insert'     | 'insert into [table] values (data)'             | user2 | 0
3         | 'delete'     | 'delete from [table] where x = data'            | user3 | 1

I use this code to create a temporary table of the unprocessed transactions, and then loop over the table, create a sql statement and then execute that transaction.

CREATE TABLE temp_incremental_updates
WITH
( 
    DISTRIBUTION = HASH ( [user] ),
    HEAP
)
AS
SELECT  ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence, 
        [user],
        [statement]
FROM    upd.incremental_updates
WHERE   processed_flag = 0;

DECLARE @nbr_statements INT = (SELECT COUNT(*) FROM temp_incremental_updates),
        @i INT = 1;

WHILE   @i <= @nbr_statements
BEGIN
    DECLARE @sql_code NVARCHAR(4000) = (SELECT [statement] FROM temp_incremental_updates WHERE Sequence = @i);
    EXEC    sp_executesql @sql_code;
    SET     @i +=1;
END

DROP TABLE temp_incremental_updates;

UPDATE incremental_updates SET processed_flag = 1

This is taking a very long time, upwards of an hour. Is there a different way I can quickly processes multiple sql statements that need to occur in a specific order? Order is relevant because, for example: if I try to process a delete statement before the insert statement that created that data, azure synapse will throw an error.


Solution

  • Less than 2 hours for 20k individual statements is pretty good for Synapse!

    Synapse isn't meant to do transactional processing. You need to convert individual updates to batch updates and execute statements like MERGE for big batches or rows instead of INSERT, UPDATE and DELETE for each row.

    In your situation, you could:

    • Group all inserts/updates by table name
    • Create a temp table for each group. E.g. table1_insert_updates
    • Run MERGE like statement from table1_insert_updates to table1.

    For deletes:

    • Group primary keys by table name
    • Run one DELETE FROM table1 where key in (primary keys) per table.

    Frankly 20k is such a bad number, it's not too small and far from big enough. So even after "grouping" you could still have performance issues if you batch/group sizes are too small.

    Synapse isn't meant for transaction processing. It'll merge a table with a million rows into a table with a billion rows in less than 5 minutes using a single MERGE statement to upsert a million rows, but if you run 1000 delete and 1000 insert statements one after the other it'll probably take longer!


    EDIT: You'll also have to use PARTITION BY and RANK (or ROWNUMBER) to de-duplicate in case there are multiple updates to same row in a single batch. Not easy depending on how your input is (update contains all columns (even unchanged) or only changed columns) this might become very complicated.

    Again Synapse is not meant for transaction processing.