I am trying to learn about SSIS and have few doubts regarding to that.
I have want to compare 2 tables.1 table is in Sql Server
and another is in Oracle
.
Both table will have same schema like below :
Sql Server:
Id Amount
1 100
2 200
3 300
Oracle:
Id Amount
3 3000
2 2000
1 1000
This is just an few sample records as i have 24 millions records in source(12 millions) and target(12 millions) in some random order.
Task : I am trying to compare source and target data.As there will always be 1 to 1 match between source and target data based on joining id column from source and target
and doing comparision on Amount column
and storing unmatch records in sql server database so i know Look up transformation
will do in this case.
But however i have some doubts :
1) If i fire select * from query for source and target then where does 24 millions records will stay? In memory?
2) Can i get memory exception in this case?
3) As both the result set(i.e) are in different order in source and target how does look up will work? Will it load all source data and then it will match 1 by 1 record in target data by not loading whole target data?
4) How does SSIS handles millions of data comparision for source and target?
Can anybody please help me clearing above doubts?
If you do this with a Lookup, neither row set will be stored completely in memory, unless you use full cache. If you use cache, then the Target data will be stored in memory, and sure, you could get memory exceptions if you don't have enough available memory.
A lookup is a terrible idea because for every row in the source data you are going to query the target data. So you are going to issue 12 million individual queries against the target before you are done. This is the worst performing option.
A Merge Join is faster because your data is pre-sorted on the matching key so the matching is much faster. Also neither dataset needs to be held in memory. The rows flow freely without waiting for the entire dataset to be loaded.
Here is a comparison between Lookup and Merge Join.
The fastest option would be to load your target data directly to a staging table on the same server as your source data, and index that table on the joining key. Then you can do the comparison in SQL, joining on indexed columns, which would give you the fastest performance.