Search code examples
sql-serverpostgresqlt-sqldata-warehouselinked-server

Accessing data for ETL for daily load SQL Server


I need your expertise in figuring out best possible option for storing data in database before it is being used in DWH/ETL or use it directly from source and do the ETL.

Current situation is we have transactional system running on PostgreSQL database. We have access to this database through linked server in our SQL Server where we have our data warehouse.

We would like to automate process of uploading data from this LinkedServer to Data warehouse. Following options / challenges we have where we would like your thoughts to help us!

We would like to have our ETL run daily in the night!

  1. Do we store data first from Linked Server into SQL Tables before we write multiple queries with multiple joins on those tables to prepare the data to upload in data warehouse?
  2. If we store data from LinkedServer to tables in SQL Server, I prefer to do truncate and load instead for incremental load from OLTP to Tables in SQL Server (1 to 1) for all those tables, since we are not able to get differential load from transnational system and people can go back and change data in transactional system and identifying records which are new and updated could be tricky as this point.

OR

we just use the source system through LinkedServer directly to write multiple joins and prepare data on the fly and load into some sort of pre-staging table? (With this option the issue that we are facing currently is when we write multiple joins directly on LinkedServer we get only 1 row, regardless of the total results/rows, but if we store all tables being used in these query join into individual temporary table and run the query from it , we get expected results) - Please let us know if someone has already experience this issue before and what was the solution for it?


Solution

  • The traditional pattern for building a data warehouse is to follow your first option.

    1. Pull a replica of the data from the source system.
    2. Manipulate the data locally in staging tables.
    3. Deposit the optimized data in your warehouse layer for consumption.

    There are several reasons to stick to the pattern. For instance, when you do manipulation at the same time as extraction, you're typically putting load on the source system, which is one of the things you were trying to get away from when you built the data warehouse to begin with. Also, virtually every ETL tool I've encountered handles data manipulation in memory on the ETL server, which is almost always less efficient than handling the manipulations with queries or stored procedures on the SQL Server machine. Sometimes devastatingly slower. Certainly there are more reasons, but those are the ones that have bitten me when I've varied from the standard practice in the past.

    If I were you, I'd test the performance of building out the replica tables from querying your linked server tables versus pulling them directly from server to server with SSIS. For a flat data move like that, SSIS can be very efficient, whereas for bulk data moves linked servers can be less so.

    That said, you don't need to give up on incremental loads just because your source system doesn't have any change data tracking available.

    You can flush and fill your replica tables every day, but when you do, you have the opportunity to introduce your own change tracking. The most common pattern for that is to introduce row hashing on import together with tables that keep track of which records you've already loaded. Getting into the details of all that is well outside the scope of this question, but there is more information about it in these two questions. Enough to fuel your own Googling, anyway.

    Identifying Differences Efficiently

    SSIS only extract Delta changes