Search code examples
c#linqoptimizationdatatableduplicate-detection

C# - Looking for the list of duplicated rows (need optimization)


Please, I would like to optimize this code in C#, if possible.

When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time... Here a little benchmark :

  • 5000 lines => ~2s
  • 15000 lines => ~20s
  • 25000 lines => ~50s

Indeed, I'm looking for duplicated lines.

Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).

Here is the code :

private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
    Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
    List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values 

    List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated

    foreach (DataRow row in table.Rows)
    {
        // Find keys fields values for the row
        List<object> rowKeys = new List<object>();
        keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));

        // Check if those keys are already defined
        bool alreadyDefined = false;

        foreach (List<object> keyValue in keys.Keys)
        {
            if (rowKeys.SequenceEqual(keyValue))
            {
                alreadyDefined = true;
                break;
            }
        }

        if (alreadyDefined)
        {
            duplicatedRows.Add(row);

            // If first duplicate for this key, add the first occurence of this key
            if (!duplicatedKeys.Contains(rowKeys))
            {
                duplicatedKeys.Add(rowKeys);

                int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
                duplicatedRows.Add(table.Rows[i]);
            }
        }
        else
        {
            keys.Add(rowKeys, table.Rows.IndexOf(row));
        }
    }

    return duplicatedRows;
}

Any ideas ?


Solution

  • I think this is the fastest and shortest way to find duplicate rows:

    For 100.000 rows it executes in about 250ms.

    Main and test data:

    static void Main(string[] args)
    {
        var dt = new DataTable();
        dt.Columns.Add("Id");
        dt.Columns.Add("Value1");
        dt.Columns.Add("Value2");
    
        var rnd = new Random(DateTime.Now.Millisecond);
        for (int i = 0; i < 100000; i++)
        {
            var dr = dt.NewRow();
            dr[0] = rnd.Next(1, 1000);
            dr[1] = rnd.Next(1, 1000);
            dr[2] = rnd.Next(1, 1000);
            dt.Rows.Add(dr);
        }
    
        Stopwatch sw = new Stopwatch();
        sw.Start();
        var duplicates = GetDuplicateRows(dt, "Id", "Value1", "Value2");
        sw.Stop();
        Console.WriteLine(
            "Found {0} duplicates in {1} miliseconds.", 
            duplicates.Count,
            sw.ElapsedMilliseconds);        
        Console.ReadKey();
    }
    

    GetDuplicateRows with LINQ:

    private static List<DataRow> GetDuplicateRows(DataTable table, params string[] keys)
    {
        var duplicates =
            table
            .AsEnumerable()
            .GroupBy(dr => String.Join("-", keys.Select(k => dr[k])), (groupKey, groupRows) => new { Key = groupKey, Rows = groupRows })
            .Where(g => g.Rows.Count() > 1)
            .SelectMany(g => g.Rows)
            .ToList();
    
        return duplicates;
    }
    

    Explanation (for those who are new to LINQ):

    The most tricky part is the GroupBy I guess. Here I take as the first parameter a DataRow and for each row I create a group key from the values for the specified keys that I join to create a string like 1-1-2. Then the second parameter just selects the group key and the group rows into a new anonymous object. Then I check if there is more then 1 row and flatten the groups back into a list with SelectMany.