Search code examples
performancepowershellcsvcompareprocessing-efficiency

PowerShell: compare 2 large CSV files to find users that don't exist in one of them


I have 2 csv files with ~10,000 users each. I need to count how many users appear in csv1 and not in csv2. At the moment I have the code below. However I'm aware that this is probably extremely inefficient as it is potentially looping through up to 10,000 users 10,000 times. The code takes forever to run and I'm sure there must be a more efficient way. Any help or suggestions are appreciated I am fairly new to Powershell

foreach ($csv1User in $csv1) {
        $found = $false
        foreach ($csv2User in $csv2) {
            if ($csv1User.identifier -eq $csv2User.identifier)
            {
                $found = $true
                break
            }
        }
        if ($found -ne $true){
            $count++
        }
    }

Solution

  • If you replace your nested loops with 2 HashSet's, you'll have two ways of calculating the exception between the two:

    Using SymmetricExceptWith()

    The HashSet<T>.SymmetricExceptWith() function allows us to calculate the subset of terms that exist in either collection but not in both:

    # Create hashset from one list
    $userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)
    
    # Pass the other list to `SymmetricExceptWith`
    $userIDs.SymmetricExceptWith([string[]]$csv2.identifier)
    
    # Now we have an efficient filter!
    $relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier
    

    Using a sets to track duplicates

    Similarly we can use hash sets to keep track of which terms that have been observed at least once, and which ones has been seen more than once:

    # Create sets for tracking
    $seenOnce = [System.Collections.Generic.HashSet[string]]::new()
    $seenTwice = [System.Collections.Generic.HashSet[string]]::new()
    
    # Loop through whole superset of records
    foreach($record in @($csv1;$csv2)){
      # Always attempt to add to the $seenOnce set
      if(!$seenOnce.Add($record.identifier)){
        # We've already seen this identifier once, add it to $seenTwice
        [void]$seenTwice.Add($record.identifier)
      }
    }
    
    # Just like the previous example, we now have an efficient filter!
    $relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier
    

    Using a hash table as a grouping construct

    You could also use a dictionary type (like a [hashtable] for example) to group records from both csv files based on their identifier, and then filter on number of record values in each dictionary entry:

    # Groups records on their identifier value
    $groupsById = @{}
    foreach($record in @($csv1;$csv2)){
      if(-not $groupsById.ContainsKey($record.identifier)){
        $groupsById[$record.identifier] = @()
      }
      $groupsById[$record.identifier] += $record
    }
    
    # Filter based on number of records with a distinct identifier
    $relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value