Search code examples
c#performancelinqlistlarge-data-volumes

How to efficiently filter objects out of an (initially) large list of objects


I need to filter a large list of complex (20+ properties) objects into multiple sub lists. To create the sub-lists, I have a list of filter specifications. Requirements are: a) An item is not allowed to be part of two sub lists and b) it must be possible to get hold of all undivided items after processing has finished.

Currently I use the following algorithm:

  1. List item
  2. Put the objects to be filtered in a Generic List
  3. For each filter specification:
    • Create a Where expression (Expression>)
    • Apply the expression using Linq > Where to the list of objects
    • Get the resulting IEnumerable of selected objects and store them in a list, together with the description of the filter
    • Remove the items found from the source list using Linq > Except to create a new list to continue working with and to prevent an object from being put in more than one sub list
  4. Check whether there a still (undivided) objects in the working list

My initial list of objects can be over 400.000 objects and I've noticed that both the filtering, as well as reducing the working list takes some time. So I would like to know:

  1. Filtering to create the sub-lists takes place on a maximum of 7 properties of my object. Is there a way to improve performance of a Linq > Where selection?
  2. Is there a way to prevent items from being selected into multiple sub-lists, without reducing the working collection by using Except or RemoveAll (possible improvement)?

Thanks in advance!


Solution

  • If you can not leverage any indexes in the incoming list you are trying to classify then you are better off just iterating through the whole list only once and classifying the items as you go. This way you avoid unnecessary remove and except operations that are seriously hurting the performance with pointless iterations and equality comparisons.

    I was thinking about something a long the lines of:

    public static IDictionary<string, List<T>> Classify<T>(this IEnumerable<T> items, IDictionary<string, Predicate<T>> predicates, out List<T> defaultBucket)
    {
        var classifiedItems = new Dictionary<string, List<T>>(predicates.Count);
        defaultBucket = new List<T>();
    
        foreach (var predicate in predicates)
        {
            classifiedItems.Add(predicate.Key, new List<T>()); 
        }
    
        foreach (var item in items)
        {
            var matched = false;
    
            foreach (var predicate in predicates)
            {
                if (predicate.Value(item))
                {
                    matched = true;
                    classifiedItems[predicate.Key].Add(item);
                    break;
                }
            }
    
            if (!matched)
            {
                defaultBucket.Add(item);
            }
        }
    
        return classifiedItems;
    }
    

    Any given predicate can be as complex as you need it to be. Only condition is that it takes in a T and returns a bool. If that is not enough, nothing is preventing you from implementing your own MyPredicate<???> with whatever signature you need.

    EDIT: Edited the code to handle a "default bucket" where items that don't comply with any of the specified predicates go.