Search code examples
c#datatableduplicatesfiltering

Remove duplicates from a datatable


I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on all the columns.

EDIT: This datatable is being created by reading a CSV file, not from a database, so using an SQL query is not an option.


Solution

  • You can use Linq to Datasets. Check this. Something like this:

    // Fill the DataSet.
    DataSet ds = new DataSet();
    ds.Locale = CultureInfo.InvariantCulture;
    FillDataSet(ds);
    
    List<DataRow> rows = new List<DataRow>();
    
    DataTable contact = ds.Tables["Contact"];
    
    // Get 100 rows from the Contact table.
    IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                                  select c).Take(100);
    
    DataTable contactsTableWith100Rows = query.CopyToDataTable();
    
    // Add 100 rows to the list.
    foreach (DataRow row in contactsTableWith100Rows.Rows)
        rows.Add(row);
    
    // Create duplicate rows by adding the same 100 rows to the list.
    foreach (DataRow row in contactsTableWith100Rows.Rows)
        rows.Add(row);
    
    DataTable table =
        System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);
    
    // Find the unique contacts in the table.
    IEnumerable<DataRow> uniqueContacts =
        table.AsEnumerable().Distinct(DataRowComparer.Default);
    
    Console.WriteLine("Unique contacts:");
    foreach (DataRow uniqueContact in uniqueContacts)
    {
        Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
    }