Delete duplicates in a single dataset relative to another one in C#

I'm new to C#. Trying to remove duplicates in CollectionIn1 but it doesn't work. No duplicates are removed in CollectionIn.

Just to clarify, the collectionIn has [A,B,C,D] and collectionIn2 has [A,B,C].

So i want to remove the values (A,B,C) in collectionIn

for (int i = 0; i < CollectionIn.Rows.Count; i++) {
    string value1 = CollectionIn.Rows[i].ItemArray[0].ToString().ToLower(); 

    for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++) {
        string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower(); 

        if (value1 == value2) {
            //Remove value1 when value1 == value2
            CollectionIn.Rows[i].Delete(); <--- Trying to delete when there is duplicate in both collections

            CollectionIn.AcceptChanges();
        }
    }
    //CollectionOut.Rows.Add(value1);
}

I made some changes from this link http://www.rpaforum.net/threads/how-to-compare-two-excel-sheet-using-c-code-in-blueprism.897/

Solution

Comparing two sets can have a complexity of O(n2). This is bad. You can get improve on that if you have an initial hash lookup.

var Set1 = new Dictionary<string, int>();

//Prehash all values in the set that won't be deleted from 
for (int i = 0; i < CollectionIn.Rows.Count; i++)
{
    string value1 = CollectionIn.Rows[i].ItemArray[0].ToString().ToLower();
    Set1.Add(value1, i);
}

//Loop over the other set
for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++)
{
    string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower();

    int foundIndex;
    if (Set1.TryGetValue(value2, out foundIndex) == false)
        continue;

    //Remove value1 when value1 == value2
    CollectionIn.Rows[foundIndex].Delete();
}
CollectionIn.AcceptChanges(); //It's probably best to save changes last as a single call

I hashed CollectionIn, and then iterated CollectionIn2. This means I needed a dictionary so I would have the CollectionIn index for deleting. If this was reversed, and CollectionIn2 was hashed, it would only need to be a hashset, and it would be better because it would be able to handle internal-duplicates within the CollectionIn set, therefore:

var Set2 = new HashSet<string>();

//Prehash all values in one set (ideally the larger set)
for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++)
{
    string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower();

    if (Set2.Contains(value2))
        continue; //Duplicate value
    else
        Set2.Add(value2);
}

//Loop over the other set
for (int i1 = 0; i1 < CollectionIn.Rows.Count; i1++)
{
    string value1 = CollectionIn.Rows[i1].ItemArray[0].ToString().ToLower();

    if (Set2.Contains(value1) == false)
        continue;

    //Remove value1 when value1 == value2
    CollectionIn.Rows[i1].Delete();
}

CollectionIn.AcceptChanges(); //It's probably best to save changes last as a single call

This pattern would apply to many data set types (including List, array, and more). Of course, if you can write SQL for remote datasets on the same database, that's better.

If you love lambda functions, it should look something like this:

var alreadyInSet2 = new HashSet<string>(CollectionIn2.Rows.Cast<DataRow>()
                    .Select(x => x[0].ToString().ToLower()));

CollectionIn.Rows.Cast<DataRow>()
                    .Where(y => alreadyInSet2.Contains(y[0].ToString().ToLower()) == false)
                    .ToList() //I think you technically need this before calling ForEach
                    .ForEach(y => y.Delete());

CollectionIn.AcceptChanges();

Also see: With two very large lists/collections - how to detect and/or remove duplicates efficiently - where more time/work can go into a broader arrange of answers and performance enhancements.