I am designing an incremental update process for a cloud based database (Azure). The only existing changelog is a .txt file that records every insert, delete, and update statement that the database processes. There is no change data capture table available, or any database table that records changes and I cannot enable watermarking on the database. The .txt file is structured as follows:
update [table] set x = 'data' where y = 'data'
go
insert into [table] values (data)
go
delete from [table] where x = data
go
I have built my process to convert the .txt file into a table in the cloud as follows:
update_id | db_operation | statement | user | processed_flag
----------|--------------|-------------------------------------------------|-------|---------------
1 | 'update' | 'update [table] set x = data where y = data' | user1 | 0
2 | 'insert' | 'insert into [table] values (data)' | user2 | 0
3 | 'delete' | 'delete from [table] where x = data' | user3 | 1
I use this code to create a temporary table of the unprocessed transactions, and then loop over the table, create a sql statement and then execute that transaction.
CREATE TABLE temp_incremental_updates
WITH
(
DISTRIBUTION = HASH ( [user] ),
HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence,
[user],
[statement]
FROM upd.incremental_updates
WHERE processed_flag = 0;
DECLARE @nbr_statements INT = (SELECT COUNT(*) FROM temp_incremental_updates),
@i INT = 1;
WHILE @i <= @nbr_statements
BEGIN
DECLARE @sql_code NVARCHAR(4000) = (SELECT [statement] FROM temp_incremental_updates WHERE Sequence = @i);
EXEC sp_executesql @sql_code;
SET @i +=1;
END
DROP TABLE temp_incremental_updates;
UPDATE incremental_updates SET processed_flag = 1
This is taking a very long time, upwards of an hour. Is there a different way I can quickly processes multiple sql statements that need to occur in a specific order? Order is relevant because, for example: if I try to process a delete statement before the insert statement that created that data, azure synapse will throw an error.
Less than 2 hours for 20k individual statements is pretty good for Synapse!
Synapse isn't meant to do transactional processing. You need to convert individual updates to batch updates and execute statements like MERGE
for big batches or rows instead of INSERT
, UPDATE
and DELETE
for each row.
In your situation, you could:
table1_insert_updates
MERGE
like statement from table1_insert_updates
to table1
.For deletes:
DELETE FROM table1 where key in (primary keys)
per table.Frankly 20k is such a bad number, it's not too small and far from big enough. So even after "grouping" you could still have performance issues if you batch/group sizes are too small.
Synapse isn't meant for transaction processing. It'll merge a table with a million rows into a table with a billion rows in less than 5 minutes using a single MERGE statement to upsert a million rows, but if you run 1000 delete and 1000 insert statements one after the other it'll probably take longer!
EDIT: You'll also have to use PARTITION BY
and RANK
(or ROWNUMBER
) to de-duplicate in case there are multiple updates to same row in a single batch. Not easy depending on how your input is (update contains all columns (even unchanged) or only changed columns) this might become very complicated.
Again Synapse is not meant for transaction processing.