Search code examples
entity-framework.net-coreentity-framework-coreone-to-many

How does one iterate over all rows and apply updates in EF Core effectively?


I need to iterate over an entire Artist table that has a 1-to-many relation ship to Songs. The details for the artists and list of songs are stored in a file and this path is saved on the Artist table, DataFilePath.

The code is below is my attempt at this, but it gives the following exception: System.InvalidOperationException: Connection is busy

// Artist to Songs is a 1 to many relationship
foreach(var artist in db.Artist.Include(x => x.Songs)) 
{
    try
    {
        // Load details for artist and songs from artist.DataFilePath
        var fileData = LoadArtistFile(artist.DataFilePath);

        // UpdateArtists() will
        //  - update each property for the Artists entity
        //  - add any new Songs 
        //  - update each Song property for existing songs
        UpdateArtist(artist, fileData);
        
        db.SaveChanges();
    } catch (Exception err)
    {
        _logger.LogError(err.Message);
    }
}

2 questions:

  1. How do I fix the code above to avoid the Connection is busy problem? (I've tried a number of other variations including lazy loading and explicit loading).

  2. If the artists table were millions of rows and therefore songs table were 10's of millions of rows - is this still the most effective way to update the both tables? (staying within EF Core)

UPDATE

I've taken David's answer and modified the approach to use batching because I'm not able to load all rows into memory at once...

int batchSize = 50; // batch of 50 gave best performance (vs 10 and 100)
int currentBatch = 0;
// get the max id from the database for a stopping point (probably a better way to do this)
int maxId = GetMaxArtistID();
bool done = false;
while (!done)
{
    using (var db = new Context())
        {
            var artists = db.Artists
                .OrderBy(x => x.Id)
                .Skip(currentBatch++ * batchSize)
                .Take(batchSize)
                .Include(x => x.Songs)
                .ToList();
            var ids = String.Join(", ", artists.Select(x => x.Id));
            Console.WriteLine($"Working on batch:{currentBatch} ids: {ids}");
            foreach (var artist in artists)
            {
                var fileData = LoadArtistFile(artist.DataFilePath)
                UpdateArtist(artist, fileData);
                if (artist.Id >= maxId)
                    done = true;
            }
            db.SaveChanges();
        }
    }
}

Solution

  • How do I fix the code above to avoid the Connection is busy problem?

    Load the data into memory before iterating it. eg

    foreach(var artist in db.Artist.Include(x => x.Songs).ToList()) 
    

    If the artists table were millions of rows and therefore songs table were 10's of millions of rows - is this still the most effective way to update the both tables? (staying within EF Core)

    EF itself doesn't do server-side data modification, so that's really the only way. But at some scale you want to not use EF, and load your data file into a table and perform the update on the server side.