I need your expertise in figuring out best possible option for storing data in database before it is being used in DWH/ETL or use it directly from source and do the ETL.
Current situation is we have transactional system running on PostgreSQL database. We have access to this database through linked server in our SQL Server where we have our data warehouse.
We would like to automate process of uploading data from this LinkedServer to Data warehouse. Following options / challenges we have where we would like your thoughts to help us!
We would like to have our ETL run daily in the night!
OR
we just use the source system through LinkedServer directly to write multiple joins and prepare data on the fly and load into some sort of pre-staging table? (With this option the issue that we are facing currently is when we write multiple joins directly on LinkedServer we get only 1 row, regardless of the total results/rows, but if we store all tables being used in these query join into individual temporary table and run the query from it , we get expected results) - Please let us know if someone has already experience this issue before and what was the solution for it?
The traditional pattern for building a data warehouse is to follow your first option.
There are several reasons to stick to the pattern. For instance, when you do manipulation at the same time as extraction, you're typically putting load on the source system, which is one of the things you were trying to get away from when you built the data warehouse to begin with. Also, virtually every ETL tool I've encountered handles data manipulation in memory on the ETL server, which is almost always less efficient than handling the manipulations with queries or stored procedures on the SQL Server machine. Sometimes devastatingly slower. Certainly there are more reasons, but those are the ones that have bitten me when I've varied from the standard practice in the past.
If I were you, I'd test the performance of building out the replica tables from querying your linked server tables versus pulling them directly from server to server with SSIS. For a flat data move like that, SSIS can be very efficient, whereas for bulk data moves linked servers can be less so.
That said, you don't need to give up on incremental loads just because your source system doesn't have any change data tracking available.
You can flush and fill your replica tables every day, but when you do, you have the opportunity to introduce your own change tracking. The most common pattern for that is to introduce row hashing on import together with tables that keep track of which records you've already loaded. Getting into the details of all that is well outside the scope of this question, but there is more information about it in these two questions. Enough to fuel your own Googling, anyway.