Search code examples
entity-frameworkdatabase-migrationexcept

Bad performance with except too large in Entity Framework


I have to do a migration from one database(O) to another of certain data that is not in the final database(D), one of the problem is that the databasee's tables has differents names in fields. I was trying to do getting the data that is not in D database with:

var dtOrigin = from o in genEntitCeres.X.AsNoTracking() select o;
var dtDestiny = from d in genEntitAgp.X.AsNoTracking() select d;

var rowsMatch = from tOrigin in dtOrigin.AsEnumerable()
                    join tDestiny in dtDestiny.AsEnumerable()
                        on new { tOrigin.a, tOrigin.b} equals
                        new { tDestiny.a, tDestiny.b}
                    select tOrigin;

var rowsNotMatch = (from tOrigin in dtOrigin.AsEnumerable()
                    where !rowsMatch.Contains(tOrigin)
                    select tOrigin);

When I tried to do a for each or rowsNotMatch.Count(), it takes too long time...

My solution was doing with SqlQuery from a query getting the dismatch rows (more less 5 seconds, depends on siz of data mismatch). But I would like to know if there is another way with EF, like my code but that is not getting freeze.


Solution

  • This looks like you are opening 2 tables in 2 different dbContexts, want to compare 2 tables based on 2 columns in each table, find the rows in original table that don't have a match. Using EF to do something like this would not be my recommendation. It would be far better to do this at the database /w temp tables. By doing the .AsEnumerable() of each table you are loading the entire contents of those tables into memory.

    If you must do something like this in code: I'd consider something like:

    var destinationMatches = genEntitAgp.X.AsNoTracking()
      .Select(x=> new { x.id, x.a, x.b })
      .ToList();
    

    If this destination table is expected to be large, then you should perform this in pages of 1000 or so using .Skip() and .Take().

    The next step would be to insert all rows into a buffer table in the origin DB. The buffer table holds the PK of the destination row, and the criteria to match on.

    public class TempX
    {
      public int Id { get; set; }
      public string a { get; set; }
      public string b { get; set; }
    }
    

    I'd recommend a separate DBContext for this rather than the Origin one, to initialize without change tracking, and clear out any records initially in the TempX table. If you're inserting > 1000 rows you want to do this in batches of 1000, saveChanges, then dispose and re-create the context with each batch to keep the operations fast. The origin dbContext will need to know about TempX as well, just not for the population.

    Once the records are in the tempX table on the source schema:

    var entities = genEntitCeres.X.AsNoTracking()
      .Where(x => !genEntitCeres.TempX.AsNoTracking().Any(tx=>tx.a == x.a && tx.b == x.b)).ToList();
    

    Again if there are expected to be a lot of non-matches, then rather than the .ToList() use a .Skip() and .Take() approach.

    If this is something that could possibly be run more than once on a given database then I'd consider also hopefully having something like a CreatedAt/ModifiedAt dateTime value that I could use to filter on for both the Origin and Destination databases. I.e. record last run DateTime, then filter rows from both queries based on > LastRunDate to reduce the # of rows pulled and compared.