c#linq optimization datatable duplicate-detection

C# - Looking for the list of duplicated rows (need optimization)

Please, I would like to optimize this code in C#, if possible.

When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time... Here a little benchmark :

5000 lines => ~2s
15000 lines => ~20s
25000 lines => ~50s

Indeed, I'm looking for duplicated lines.

Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).

Here is the code :

private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
    Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
    List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values 

    List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated

    foreach (DataRow row in table.Rows)
    {
        // Find keys fields values for the row
        List<object> rowKeys = new List<object>();
        keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));

        // Check if those keys are already defined
        bool alreadyDefined = false;

        foreach (List<object> keyValue in keys.Keys)
        {
            if (rowKeys.SequenceEqual(keyValue))
            {
                alreadyDefined = true;
                break;
            }
        }

        if (alreadyDefined)
        {
            duplicatedRows.Add(row);

            // If first duplicate for this key, add the first occurence of this key
            if (!duplicatedKeys.Contains(rowKeys))
            {
                duplicatedKeys.Add(rowKeys);

                int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
                duplicatedRows.Add(table.Rows[i]);
            }
        }
        else
        {
            keys.Add(rowKeys, table.Rows.IndexOf(row));
        }
    }

    return duplicatedRows;
}

Any ideas ?

Solution

I think this is the fastest and shortest way to find duplicate rows:

For 100.000 rows it executes in about 250ms.

Main and test data:

static void Main(string[] args)
{
    var dt = new DataTable();
    dt.Columns.Add("Id");
    dt.Columns.Add("Value1");
    dt.Columns.Add("Value2");

    var rnd = new Random(DateTime.Now.Millisecond);
    for (int i = 0; i < 100000; i++)
    {
        var dr = dt.NewRow();
        dr[0] = rnd.Next(1, 1000);
        dr[1] = rnd.Next(1, 1000);
        dr[2] = rnd.Next(1, 1000);
        dt.Rows.Add(dr);
    }

    Stopwatch sw = new Stopwatch();
    sw.Start();
    var duplicates = GetDuplicateRows(dt, "Id", "Value1", "Value2");
    sw.Stop();
    Console.WriteLine(
        "Found {0} duplicates in {1} miliseconds.", 
        duplicates.Count,
        sw.ElapsedMilliseconds);        
    Console.ReadKey();
}

GetDuplicateRows with LINQ:

private static List<DataRow> GetDuplicateRows(DataTable table, params string[] keys)
{
    var duplicates =
        table
        .AsEnumerable()
        .GroupBy(dr => String.Join("-", keys.Select(k => dr[k])), (groupKey, groupRows) => new { Key = groupKey, Rows = groupRows })
        .Where(g => g.Rows.Count() > 1)
        .SelectMany(g => g.Rows)
        .ToList();

    return duplicates;
}

Explanation (for those who are new to LINQ):

The most tricky part is the GroupBy I guess. Here I take as the first parameter a DataRow and for each row I create a group key from the values for the specified keys that I join to create a string like 1-1-2. Then the second parameter just selects the group key and the group rows into a new anonymous object. Then I check if there is more then 1 row and flatten the groups back into a list with SelectMany.