Search code examples
sql-servervb.netlinqbulkinsertbulkupdate

is there something faster than Enumerable.Except<TSource> Method?


I have a program that downloads data from server database to client database. server database keeps growing recently.

in that program, there is an option to select download all data OR download data for a specific time period (can select backward days from today). if the user selects all, I wrote the program to truncate client database table and insert all data using bulk copy. that part is ok.

but the problem is when user select a specific time period (each recode has created data time ) program has to compare two tables and divide recodes (server data) in two tables. one is, not exist data and the second one is not existing data. and what I'm going to do is, not existing data directly insert into client DB (i'm using bulk insert) and Existing data inserting into a tempory table using bulkcopy and after update client's table using the above tempory table. My actual problem occurs when dividing server's table. this is how I did it

 updateTable = (From c In dt_from_server.AsEnumerable() 
                Join o In Dt_from_client.AsEnumerable() 
                On c.Field(Of String)("BARCODE").Trim() Equals o.Field(Of String)("BARCODE").Trim() 
                And c.Field(Of String)("ITEM_CODE").Trim() Equals o.Field(Of String)("ITEM_CODE").Trim() 
                Select c).CopyToDataTable()

 insertTable = dt_server.AsEnumerable()
    .Except(updateTable.AsEnumerable(), DataRowComparer.Default)
    .CopyToDataTable()

(normally there is over 1M recodes in the server table )

when there is over 1 Milion recodes, Update part taking acceptable time like 10 minutes (Yes it taking 5GB space from Ram - in this case, it's ok when considering performance ) but insert part seams taking days, just to assing the insertTable(datatable). this is the issue.
AsEnumerable().Except() part taking long time and I couldn't find a solution speedup this process. I'm not sure I explained this correctly. Could anyone can give me some advice for this?


Solution

  • Since you have commented that dt_from_server and dt_server are actually the same DataTable you don't need to compare all values of all DataRows with each other, which is what DataRowComparer.Default does. You can use Except without second parameter for the comparer, then only references are compared which is much faster.

    You also don't need two CopyToDataTable which creates two additonal big DataTables in memory, process the rows one after the other.

    Here is a different approach using Linq's left-outer join, which is more efficient:

    Dim query = from rServ in dt_from_server.AsEnumerable()
                group join rClient in Dt_from_client.AsEnumerable()
                On New With{
                    Key .BarCode = rServ.Field(Of String)("BARCODE").Trim(),
                    Key .ItemCode = rServ.Field(Of String)("ITEM_CODE").Trim()
                } Equals New With{
                    Key .BarCode = rClient.Field(Of String)("BARCODE").Trim(),
                    Key .ItemCode = rClient.Field(Of String)("ITEM_CODE").Trim()
                }   into Group
                From client In Group.DefaultIfEmpty()  
                Select new With { .ServerRow = rServ, .InsertRow = client is Nothing }
    
    Dim insertOrUpdateRows = query.ToLookup(Function(x) x.InsertRow, Function(x) x.ServerRow)
    Dim insertRows = insertOrUpdateRows(true).CopyToDataTable()  'CopyToDataTable redundant if you process rows immediately now' 
    Dim updateRows = insertOrUpdateRows(false).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now' 
    

    But in general the most scalable and efficient approach would be to not load all into memory at once and then process all, but to use database paging(or a stored-procedure) to process only parts of it in memory, otherwise it's likely that you will encounter a OutOfMemoryException sooner or later.


    C# as requested:

    var query = from rServ in dt_from_server.AsEnumerable()
                join rClient in Dt_from_client.AsEnumerable()
                on new { BarCode = rServ.Field<string>("BARCODE").Trim(), ItemCode = rServ.Field<string>("ITEM_CODE").Trim() }
                equals new { BarCode = rClient.Field<string>("BARCODE").Trim(), ItemCode = rClient.Field<string>("ITEM_CODE").Trim() } 
                into clientGroup
                from client in clientGroup.DefaultIfEmpty()
                select new { ServerRow = rServ, InsertRow = client == null };
    
    var insertOrUpdateRows = query.ToLookup(x => x.InsertRow, x => x.ServerRow);
    var insertRows = insertOrUpdateRows[true].CopyToDataTable();  // CopyToDataTable redundant if you process rows immediately now
    var updateRows = insertOrUpdateRows[false].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now