Search code examples
ssisreal-timescheduling

How to ensure consistency with multiple SSIS-packages filling the same tables?


I have a SSIS package with 2 data flow tasks. The first data flow task is filling values into a dimension table. The second data flow task is filling surrogate keys into the fact table. The fact table is referencing the previous filled dimension table via surrogate key. However, a further SSIS package is doing exactly the same, but with data from another data source. Both SSIS packages are fired by SQLServer Agent in low frequency (each 20 - 40 seconds). I am worrying about consistency. If I had a single SSIS package that loads the data into the dimension table and fact table, I wouln't have to because it would be possible to create the control flow to enforce the following sequence:

  1. Fill the Dimension table with data from data source 1
  2. Fill the Fact table with data from data source 1 (correct surrogate key to Dim)
  3. Fill the Dimension table with data from data source 2
  4. Fill the Fact table with data from data source 2 (correct surrogate key to Dim)

So in this case the primary key of the Dimension table as well as the corresponding surrogate key in the fact table could be auto-incremented simply in SQL Server DB and everything would be fine. But since I have 2 SSIS packages, each running independently on a multi-core ETL server in low frequency, I am worrying about the case when the following will happen:

  1. Both packages are starting approximately at the same time
  2. Fill the Dimension table with data from data source 1
  3. Fill the Dimension table with data from data source 2
  4. Fill the Fact table with data from data source 2 (surrogate key to wrong Dim record)
  5. Fill the Fact table with data from data source 1 (surrogate key to wrong Dim record)

Are there any common best practises or, on the other hand, is such a handling necessary or does SQL Server handle such situation by default e.g. by forbid packages to be processed in parallel? Maybe a Write Lock on both tables during the start of each SSIS-package could be satisfactory but in this case I am worrying that this could result in a failure thrown by the other SSIS-package if it cannot reach the destination tables. I am new to SSIS and I would like to know my options about any good techniques to avoid this situation (if necessary).


Solution

  • One option is to use transactions in SSIS. You can embed in the transaction the critical part of the ETLs.

    But I'm not sure to understand what makes you think there could be a problem. If you use an identity column on your dimension table, there can not be duplicates, no matter how many threads insert at the same time. In your step 4 and 5, how could you get a surrogate to a wrong record ? Please illustrate your question with an example of how you plan to match your fact with your Dim record.