I have 2 csv files with ~10,000 users each. I need to count how many users appear in csv1 and not in csv2. At the moment I have the code below. However I'm aware that this is probably extremely inefficient as it is potentially looping through up to 10,000 users 10,000 times. The code takes forever to run and I'm sure there must be a more efficient way. Any help or suggestions are appreciated I am fairly new to Powershell
foreach ($csv1User in $csv1) {
$found = $false
foreach ($csv2User in $csv2) {
if ($csv1User.identifier -eq $csv2User.identifier)
{
$found = $true
break
}
}
if ($found -ne $true){
$count++
}
}
If you replace your nested loops with 2 HashSet's, you'll have two ways of calculating the exception between the two:
SymmetricExceptWith()
The HashSet<T>.SymmetricExceptWith()
function allows us to calculate the subset of terms that exist in either collection but not in both:
# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)
# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)
# Now we have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier
Similarly we can use hash sets to keep track of which terms that have been observed at least once, and which ones has been seen more than once:
# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()
# Loop through whole superset of records
foreach($record in @($csv1;$csv2)){
# Always attempt to add to the $seenOnce set
if(!$seenOnce.Add($record.identifier)){
# We've already seen this identifier once, add it to $seenTwice
[void]$seenTwice.Add($record.identifier)
}
}
# Just like the previous example, we now have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier
You could also use a dictionary type (like a [hashtable]
for example) to group records from both csv files based on their identifier, and then filter on number of record values in each dictionary entry:
# Groups records on their identifier value
$groupsById = @{}
foreach($record in @($csv1;$csv2)){
if(-not $groupsById.ContainsKey($record.identifier)){
$groupsById[$record.identifier] = @()
}
$groupsById[$record.identifier] += $record
}
# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value