Search code examples
c#entity-frameworkmemory-leaksgarbage-collectiondispose

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets


In the following code in each iteration of the loop I do the following things:

  1. Fetch 5000 entities from DB
  2. According to the logic in FilterRelationsToDelete I decide which entities to delete
  3. I add the ids of entities to delete to a collection
  4. After I finish the loop I delete from DB the entities according to idsToDelete collection.

I saw in visual studio "diagnostics tool" that the memory of the process is rising in the beginning of each loop iteration and after the iteration finishes it decreases by half, My problem is that sometimes it raises to 800MB and drops to 400MB, sometimes it is steady on 200MB, and sometimes it is over 1GB and drops to 500MB and stay steady on that.

I am not sure why my process memory is not steady on 200MB with small spikes when the data arrives from the DB. what might be the reasons for that? maybe Entity framework does not free all the memory it used? maybe the GC I activated here on purpose does not clean all the memory as I expected? maybe I have a bug here that I am not aware of?

The list of longs memory that I accumulate in idsToDelete is almost zero, this is not the problem. Is there any way to write this code better?

private static void PlayWithMemory()
{
    int buketSize = 5000;
    List<long> idsToDelete = new List<long>();
    for (int i = 0; i < 500; i++)
    {
        System.GC.Collect();//added just for this example
        using (var context = new PayeeRelationsContext())
        {
            int toSkip = i * bucketSize;
            List<PayeeRelation> dbPayeeRelations = GetDBRelations(context, toSkip, buketSize);
            var relationsToDelete = FilterRelationsToDelete(dbPayeeRelations);
            List<long> ids = relationsToDelete.Select(x => x.id).ToList();
            idsToDelete.AddRange(ids);
            Console.WriteLine($"i = {i}, toSkip = {toSkip}, payeeRelations.Count = {payeeRelationsIds.Count}");
        }
    }
}

private static List<PayeeRelation> GetDBRelations(PayeeRelationsContext context, int toSkip,
    int bucketSize)
{
    return context.PayeeRelations
        .OrderBy(x => x.id)
        .Include(x => x.PayeeRelation_PayeeVersion)
        .Skip(toSkip)
        .Take(bucketSize)
        .AsNoTracking()
        .ToList();
}

Solution

  • I don't see anything inherently wrong with your code to indicate a memory leak. I believe what you are observing is simply that the garbage collection does not fully "release" memory as soon as the references are deemed unused or out of scope.

    If memory use/allocation is a concern then you should consider projecting down to the minimal viable data you need to validate in order to identify which IDs need to be deleted. For example, if you need the ID and Field1 from the PayeeRelations, then need Field2 and Field3 from the related PayeeVersion:

    private class RelationValidationDetails
    {
        public long PayeeRelationId { get; set; }
        public string Field1 { get; set; }
        public string Field2 { get; set; }
        public DateTime Field3 { get; set; }
    }
    

    ....then in your query:

    var validationData = context.PayeeRelations
        .OrderBy(x => x.id)
        .Select(x => new RelationValidationDetails 
        {
            PayeeRelationId = x.Id,
            Field1 = x.Field1,
            Field2 = x.PayeeRelation_PayeeVersion.Field2,
            Field3 = x.PayeeRelation_PayeeVersion.Field3
        }).Skip(toSkip)
        .Take(bucketSize)
        .ToList();
    

    Then your validation just takes the above collection of validation details to determine which IDs need to be deleted. (assuming it bases this decision on Fields 1-3) This ensures that your query only returns back exactly what data is needed to ultimately get the IDs to delete, minimizing memory growth.

    There could be an argument that if later a "Field4" is required to do the validation would mean you have to update this object definition and revise the query which is extra work when you could just use the entity. However, Field4 might not come from PayeeRelations or the PayeeVersion, it might come from a different related entity which currently isn't eager loaded. This would introduce an overhead of having to add the cost of eager loading another table for every caller of that wrapped GetPayeeRelations call, whether they need that data or not. That, or risking performance hits from lazy loading (removing the AsNoTracking()) or introducing conditional complexity to tell the GetPayeeRelations which relationships need to be eager loaded. Trying to predict this possibility is really just an example of YAGNI.

    I generally don't recommend hiding EF queries behind getter methods (such as generic repositories) simply because these tend to form a lowest-common denominator while chasing DNRY or SRP. The reality is that they end up being single points that are inefficient in many cases because if any one consumer needs a relationship eager loaded, all consumers get it eager loaded. It's generally far better to allow your consumers to have the ability to project down to just exactly what they need rather than worry that similar (rather than identical) queries might appear in multiple places.