Search code examples
c#garbage-collection

How to avoid garbage collation problems when dealing with large in memory lists in c#


We had some problems with SQL performance with our article-tags relationships, so we have decided to keep our article/tags in-memory, which gave us significant boost, but it's now causing us headaches with garbage collection when entire lists are removed and replaced with new one (3m + records).

Here is a piece of code:

private readonly IContextCreator _contextCreator;
private volatile  static List<TagEngineCacheResponse> _cachedList = new List<TagEngineCacheResponse>();
private readonly int KEYWORD_GROUP_NAME = 1;
private static BitmapIndex.BitmapIndex _bitMapIndex = new BitmapIndex.BitmapIndex();

public TagEngineService(IContextCreator contextCreator)
{
    _contextCreator = contextCreator;
}

public async Task RepopulateEntireCacheAsync()
{

    using (var ctx = _contextCreator.PortalContext())
    {

        var cmd = ctx.Database.Connection.CreateCommand();
        cmd.CommandText = BASE_SQL_QUERY;

        await ctx.Database.Connection.OpenAsync();
        var reader = await cmd.ExecuteReaderAsync();

        var articles = ((IObjectContextAdapter)ctx)
            .ObjectContext
            .Translate<TagEngineCacheResponse>(reader).ToList();

        //recreate bitmap indexes
        BitmapIndex.BitmapIndex tempBitmapIndex = new BitmapIndex.BitmapIndex();
        int recordRow = 0;
        foreach (var record in articles)
        {
            tempBitmapIndex.Set(new BIKey(KEYWORD_GROUP_NAME, record.KeywordId), recordRow);
            recordRow++;
        }

        
        _cachedList = articles;
        _bitMapIndex = tempBitmapIndex;

       
    }

}

Class definition:

public class TagEngineCacheResponse
{
    public int ArticleId { get; set; }
    public int KeywordId { get; set; }
    public DateTime PublishDate { get; set; }
    public int  ViewCountSum { get; set; }
}

As you can see, when cache is recreated, _cachedList is replaced with a new list, and old one is prepared to be garbage collected. At this point, cpu time for GC jumps to 60-90% for 2-3 seconds.

Are there any ideas how to improve this piece of code to avoid GC problems?


Solution

  • I would guess the list would take about 44 bytes per object, or ~130Mb for 3m objects. This is a bit on the large side, but not incredibly so.

    Some suggestions:

    The list is well over the 87k limit for the small object heap (SOH), so it will be allocated on the large object heap (LOH). This is only collected in gen 2 and gen 2 collections can be expensive. To avoid this it is recommended to avoid de-allocation of gen2 objects as much as possible, i.e. allocate them once and then reuse them as much as possible.

    You could fetch the list from the database in smaller chunks and update the list in place. Make sure each chunk is within the limit of the SOH. You might consider either locking the list to ensure it is not accessed while updating, or keep two alternating lists where you update one and then switch the 'active' list.

    You are using a class for the TagEngineCacheResponse, this will cause a great deal of objects to be allocated. While these are small enough to fit on the SOH, they may, if you are unlucky, survive long enough to be put on gen 2 heap. While GC time is not greatly affected by un-referenced objects, it might still be better to use a value type and avoid the problem. Profile to make sure it actually helps.