Search code examples
azure-sql-database

Copy data incrementally between databases


I have two databases in Azure - Dev and DevInt (not Prod and Preprod).

In each of them there is a table Ingest (id int, name(varchar(100)) .

The Dev table is constantly loaded with new data. I want to copy this data once a day to DevInt.

I wanted to do it using this solution: https://learn.microsoft.com/en-us/azure/azure-sql/database/elastic-query-getting-started-vertical?view=azuresql

As both tables (in Dev and DevInt) have the same name, I created a view in Dev database that just selects all from original table. External table in DevInt points to this view in Dev.

When I run simple select * query without any conditions, it works fine, but whenever I add a condition like id > 100, I get a timeout. (Column Id is a PK in the original table in Dev)

Is there any way to follow this approach and get result in reasonable time or generally my idea wrong? If it's wrong - what can I use from Azure stack to synchronize these tables?


Solution

  • Azure SQL databases are not intended for cross database queries and elastic queries is solution for that, allowing you to perform cross-database queries to access remote tables. However, although Azure SQL tries to push parameters of the filtering parameters (WHERE clause) to the remote table to take advantage of indexes, you still have to make sure indexes support the query on the remote table, avoid varchar(max) and varbinary(max) columns, avoid querying remote tables without filters (WHERE), but still there is a significant overhead related to the bridge between the databases. The more data you bring from the local database to participate in JOIN operations the worst the performance also. To avoid this, you can consider not retrieving the data from the remote database and instead create a stored procedure on the remote database that can return only the result set of data for you. For example:

    select @cmd = N'sp_AnStoredProcedureName ' + @subid_list
    
    EXEC sp_execute_remote @data_source_name = N'YourElasticDBQueryDataSrc', @stmt = @cmd;
    

    It is very important you know how to push parameters to the remote database when needed, especially when you query big tables. Read here to make sure you know how to do this. Try to filter as much data as you can on the remote table to retrieve from there the smaller number of rows as possible.

    A different approach is to use a Sync tool like SQL Data Sync that can easily help you synch tables between databases, and you can schedule a sync every day or you can do it manually, or with a frequency of seconds, minutes, hours, days. SQL Data Sync keeps track of data changes since the last sync. If you try this option, start with an empty schema on the Dev database to speed up the initial sync.

    Finally, you can use Azure Data Factory to do incremental loads of data by using watermarks, change tracking or slicing. You can read more about this approach here.