Search code examples
sql-serverssissql-server-2016linked-server

Incremental load across two different servers with different source DB’s


I need to change our data loads from full load into incremental and with this change we are going to rebuild whole ETL process.

Here’s how data source infrastructure looks like now:

We are having two production servers (for two different products), let’s say P1 and P2. On P1 our data sources are two databases: DB1 and DB2 located on the linked server S1. On P2 there is only one database DB3 on linked server S1. In the future there will be added another database DB4 for P2 and product P3 added. There are SQL views on server S1 presenting all of the data.

And our ETL:

Two different SSIS projects for P1 and P2, which actually differs only in connection strings. DB1 and DB2 merged by Union ALL SSIS component directly in Data Flow tasks. Currently SSIS packages are executing SQL queries stored inside tasks, change in ETL for P1 is results in redoing same change in ETL for P2. Data is loaded twice a day on P1 and every 5 minutes on P2, with data load everything is truncated and loaded into staging tables in both data warehouses.

Goal:

Our goal is to create one universal ETL process with parametrization that allows us to use DB3 when SQL is executed on server P2 and DB1+DB2 when it’s executed on server P1, with possibility to extend this to P3+DB4 and DB3. We’d also like to move SQL code from packages into Stored Procedures, so maintaining would be easier from developer perspective.
We also need to make ETL happen more often on P1 but at the same time we are not allowed to query whole dataset multiple times in short period on linked server this will be a problem on P2 once the dataset grows larger with time.

Things we want to avoid: Dynamic SQL.

What are the best practices to create incremental data loads and such parametrization in SSIS? We are in constant contact with developer responsible for server S1 and if we need any kind of view, he will be able to deliver it.


Solution

  • The general pattern I would take would be something like this.

    My control flow will identify the databases on the server associated with our project (Connection Manager = Source)

    enter image description here

    Here I show a query against sys.databases because maybe you can apply a criteria like AND D.Name IN ('DB1', 'DB2', 'DB3');

    On S1, that query would return 2 values, on S2, only 1.

    We'd use that list of databases as the source of a ForEach Loop Enumerator to "shred" the results. For each value we identified in the original query (DB1, DB2) we're going to update the InitialCatalog property of our Source ConnectionManager. In the reference answers below, I set the ConnectionString property but you will only want to modify the InitialCatalog. So each loop around, the database pointed to will change.

    The Data Flow inside the ForEach enumerator is then simplified to only deal with the current database and not have to worry about whether this server has 3 source databases or 1.

    enter image description here

    Caveats

    The source query and data types must be compatible across all the associated databases. The structure of a Data Flow is set at design-time and cannot be changed during run-time.

    If the entities are consistent across the databases and it's just columns being called something different, create a view across each database to ensure the entity name is consistent and then you can avoid dynamic SQL.

    You will need to provide the initial value of the Source connection string when the package begins. This can be accomplished via the SET attribute on invocation.

    Reference answers

    Some relevant SSIS answers that explore these concepts