performance powershell csv compare processing-efficiency

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

I have 2 csv files with ~10,000 users each. I need to count how many users appear in csv1 and not in csv2. At the moment I have the code below. However I'm aware that this is probably extremely inefficient as it is potentially looping through up to 10,000 users 10,000 times. The code takes forever to run and I'm sure there must be a more efficient way. Any help or suggestions are appreciated I am fairly new to Powershell

foreach ($csv1User in $csv1) {
        $found = $false
        foreach ($csv2User in $csv2) {
            if ($csv1User.identifier -eq $csv2User.identifier)
            {
                $found = $true
                break
            }
        }
        if ($found -ne $true){
            $count++
        }
    }

Solution

If you replace your nested loops with 2 HashSet's, you'll have two ways of calculating the exception between the two:

Using `SymmetricExceptWith()`

The HashSet<T>.SymmetricExceptWith() function allows us to calculate the subset of terms that exist in either collection but not in both:

# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)

# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)

# Now we have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier

Using a sets to track duplicates

Similarly we can use hash sets to keep track of which terms that have been observed at least once, and which ones has been seen more than once:

# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()

# Loop through whole superset of records
foreach($record in @($csv1;$csv2)){
  # Always attempt to add to the $seenOnce set
  if(!$seenOnce.Add($record.identifier)){
    # We've already seen this identifier once, add it to $seenTwice
    [void]$seenTwice.Add($record.identifier)
  }
}

# Just like the previous example, we now have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier

Using a hash table as a grouping construct

You could also use a dictionary type (like a [hashtable] for example) to group records from both csv files based on their identifier, and then filter on number of record values in each dictionary entry:

# Groups records on their identifier value
$groupsById = @{}
foreach($record in @($csv1;$csv2)){
  if(-not $groupsById.ContainsKey($record.identifier)){
    $groupsById[$record.identifier] = @()
  }
  $groupsById[$record.identifier] += $record
}

# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

Using SymmetricExceptWith()

Using a sets to track duplicates

Using a hash table as a grouping construct

Using `SymmetricExceptWith()`