Search code examples
regexpowershelltext-manipulation

PowerShell: Delete similar lines from file


Consider the file tbl.txt (1.5 million lines), built like:

Num1 ; Num2 ; 'Value' ; 'Attribute'

So tbl.txt looks like:

  63 ; 193 ; 'Green'  ; 'Color'
 152 ; 162 ; 'Tall'   ; 'Size'
 230 ; 164 ; '130lbs' ; 'Weight'
 249 ; 175 ; 'Green'  ; 'Color'      *duplicate on 'Value' and 'Attribute'*
 420 ; 178 ; '8'      ; 'Shoesize'
 438 ; 172 ; 'Tall'   ; 'Size'       *duplicate on 'Value' and 'Attribute'*

How can i keept the first unique line on 'Value' and 'Attribute' and delete following duplicate lines on 'Value' and 'Attribute' ?

The result should look like:

  63 ; 193 ; 'Green'  ; 'Color'
 152 ; 162 ; 'Tall'   ; 'Size'
 230 ; 164 ; '130lbs' ; 'Weight'
 420 ; 178 ; '8'      ; 'Shoesize'

Any help is much appreciated.


Solution

  • Loop over the text-file via Get-Content, separate the columns 'Value' ; 'Attribute' through string operations, and then use a hashmap in order to check whether you already processed a similar line -- if not, output the line once. In code:

    $map = @{};
    Get-Content tbl.txt | ` 
                 %{ $key = $_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1); `
                    If(-not $map.ContainsKey($key)) { $_; $map[$key] = 1 } `
                  } 
    

    Alternatively, as mentioned in the comments, you can use group and apply the same substring as grouping criterium, and finally take the first element of each group:

    Get-Content tbl.txt | group {$_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1)} `
                        | %{$_.Group[0]}