Search code examples
c#performancelistmorelinq

Is there a better way to remove duplicate values on multiple properties in a List?


Trying to remove duplicates from a list where a duplicate is when either the first, the second or both properties are equal (appear more than once in the list). Using MoreLINQ, the code below is working:

var list = new List<LinqTest> // LinqTest: object containing 2 strings
{
    // are ok
    new LinqTest { Str1 = "a1", Str2 = "b1"},
    new LinqTest { Str1 = "a2", Str2 = "b2"},
    new LinqTest { Str1 = "a3", Str2 = "b3"},
    new LinqTest { Str1 = "a5", Str2 = "b5"},
    new LinqTest { Str1 = "a6", Str2 = "b6"},
    new LinqTest { Str1 = "x1", Str2 = "y1"},
    new LinqTest { Str1 = "y1", Str2 = "x1"},

    // must be removed
    new LinqTest { Str1 = "d1", Str2 = "b4"},
    new LinqTest { Str1 = "d1", Str2 = "d2"},
    new LinqTest { Str1 = "d1", Str2 = "d2"},
    new LinqTest { Str1 = "a4", Str2 = "d2"},
    new LinqTest { Str1 = "d3", Str2 = "b7"},
    new LinqTest { Str1 = "d3", Str2 = "b8"},
    new LinqTest { Str1 = "d3", Str2 = "b8"},
};

var duplicatesStr1 = list
    .GroupBy(x => x.Str1)
    .Where(x => x.Count() > 1)
    .SelectMany(x => x)
    .ToList();

var duplicatesStr2 = list
    .GroupBy(x => x.Str2)
    .Where(x => x.Count() > 1)
    .SelectMany(x => x)
    .ToList(); ;

var res = list
    .ExceptBy(duplicatesStr1, x => x.Str1)
    .ExceptBy(duplicatesStr2, x => x.Str2);

var rem = duplicatesStr1
    .Union(duplicatesStr2)
    .DistinctBy(x => new { x.Str1, x.Str2})
    .ToList();

Console.WriteLine("----------");
foreach (var linqTest in res)
{
    Console.WriteLine("keep> " + linqTest.Str1 + "-" + linqTest.Str2);
}

Console.WriteLine("----------");
foreach (var linqTest in rem)
{
    Console.WriteLine("remove> " + linqTest.Str1 + "-" + linqTest.Str2);
}

Question: Is there a more efficient and/or shorter way to achieve this?


Solution

  • You could do two passes by first getting all the values that exist for the first and second properties more than once, and then filter on that. You'd need 2 hash sets for each property. The first one is to keep track if a value has been seen at least once. And if it has been seen at least once it's then added to the second hash set. Thus resulting in one hash set for each property that only contains the values that are duplicated. Then just filter out any items that are in either of those hash sets.

    HashSet<string> hash1Once = new HashSet<string>();
    HashSet<string> hash1More = new HashSet<string>();
    HashSet<string> hash2Once = new HashSet<string>();
    HashSet<string> hash2More = new HashSet<string>();
    
    foreach(var item in list){
        if(!hash1Once.Add(item.Str1))
            hash1More.Add(item.Str1);
        if(!hash2Once.Add(item.Str2))
            hash2More.Add(item.Str2);
    }
    
    var unique = list.Where(x => !hash1More.Contains(x.Str1) && !hash2More.Contains(x.Str2))
        .ToList();