Search code examples
powershellcsvsortingduplicates

Remove duplicates from a CSV file based on two columns


I have a CSV (in $inputCsv) with three columns from which I would like to remove duplicates from SubName while keeping the minimum Rank category (minimum = TEST1 and maximum = TEST3)

Name     SubName     Rank
-----    --------    -----
NAME1    SUBNAME1    TEST1
NAME2    SUBNAME2    TEST1
NAME2    SUBNAME2    TEST3
NAME3    SUBNAME3    TEST2
NAME4    SUBNAME4    TEST3
NAME4    SUBNAME4    TEST2

Desired result:

Name     SubName     Rank
-----    --------    -----
NAME1    SUBNAME1    TEST1
NAME2    SUBNAME2    TEST1
NAME3    SUBNAME3    TEST2
NAME4    SUBNAME4    TEST2

Of course TEST3 can be kept in case it is the only value available.

For ease, I transform Rank into a numeral column so that I can order it. I tried several Sort-Object assemblies without success with, for example:

$inputCsv | Select-Object "Name", "SubName", "Rank", @{Name = 'Weight'; Expression = { 
    if ( $_."Rank" -eq "TEST1" ) { 1 } 
    elseif ( $_."Rank" -eq "TEST2" ) { 2 } 
    elseif ( $_."Rank" -eq "TEST3" ) { 3 } }
} | Sort-Object "SubName",{$_.Weight -lt 3} -Unique

Solution

  • Use Group-Object first grouping by the SubName property then for each group use Sort-Object sorting by Rank (the expression $_.Rank -replace '\D' -as [int] is removing any non-numeric digit and then converting the resulting string into an integer for proper sorting) lastly use Select-Object to take only 1 object and skip the rest:

    $inputCsv | Group-Object SubName | ForEach-Object {
        $_.Group |
            Sort-Object { $_.Rank -replace '\D' -as [int] } |
            Select-Object -First 1
    }
    

    NOTE: If the values in the Rank property will always be a single digit you could just use Sort-Object Rank.