Search code examples
powershellsortingforeach

how do I only output lines from a string array that contain a keyword match in two or more lines?


Using Compare-Object assigned to a variable:

    $diff = Compare-Object -referenceobject (Get-Content $hashlog1 -Encoding UTF8 | select -skip 1) -differenceobject (Get-Content $hashlog2 -Encoding utf8 | select -skip 1) | 
        Select @{N='Hash'; E={($_.InputObject -split ' ',2)[0] }}, @{N='File'; E={((($_.InputObject -split ' ',2)[1]) -split ' ' | select -skiplast 2) -join ' ' }}, @{N='SizeDate'; E={($_.InputObject -split ' ' | select -last 2) -join ' ' }}, SideIndicator | 
        Format-Table -AutoSize | Out-String -Width 4096

Here is a sample log file call it hashlog.txt after a compare-object:

Hash                             File                                            SizeDate               SideIndicator
----                             ----                                            --------               -------------
D41D8CD98F00B204E9800998ECF8427E \added.txt                                      0 20230107_021401      =>           
6B7CA4894B3CFCBA1ECA6B8BB9656FE8 \dirsize2.bat                                   714 20231010_111350    =>           
804AB051DB174BF5FF53911647A094C2 \SSDTest ™\logfile.txt                          122644 20220806_221741 =>           
9266C28971E3624B28DAB39ADEE0694E \SSDTest ™\logfilemixed.txt                     14627 20220807_115714  =>           
4298F8C3383A93D121A1A91764492F93 \SSDTest ™\newfile.rtf                          42098 20231010_111523  =>           
1233EEF8C71A6AF8D23068CCDF1E639D \SSDTest ™\Samsung 850 EVO Small Files Only.txt 205671 20220805_224013 =>           
BD30A4E10CA22E3E3C6BA7063ACEBF0D \SSDTest ™\SSDCopy.bat                          269 20220731_172008    =>           
AF32AC8BEFDF8D3B63DE5D7B834709FD \SSDTest ™\SSDCopysmall.ps1                     1670 20220806_221654   =>           
AA7A12226FDD40671C191FAF6AA57733 \SSDTest ™\SSDCopy_Write_Read.bat               462 20220805_023811    =>           
3EDC4CE7B65FBC140B7D0F5604F3070D \SSDTest ™\SSDMixedWrite.ps1                    2476 20220807_115705   =>           
768B150D59F7EA576F375430883CC8DA \SSDTest ™\SSDRead.ps1                          2044 20220805_021614   =>           
B7D8DA3D9C1387EBCEC0565176385BB7 \SMR FINAL\SMR_PC_TEST_1_FILLRND.log            13014 20220823_134552  <=           
D41D8CD98F00B204E9800998ECF8427E \added.txt                                      0 20230107_021409      <=           
E17C51851C46801F9399087B61736E34 \dirsize2.bat                                   711 20230107_000101    <=           
E7A73A93D06EB266E614FC5110BBBF28 \SMR FINAL\SMR_PC_TEST_1_FILLRND.log            13014 20220822_133356  =>           
804AB051DB174BF5FF53911647A094C2 \SSDTest\logfile.txt                            122644 20220806_221741 <=           
9266C28971E3624B28DAB39ADEE0694E \SSDTest\logfilemixed.txt                       14627 20220807_115714  <=           
1233EEF8C71A6AF8D23068CCDF1E639D \SSDTest\Samsung 850 EVO Small Files Only.txt   205671 20220805_224013 <=           
AF32AC8BEFDF8D3B63DE5D7B834709FD \SSDTest\SSDCopysmall.ps1                       1670 20220806_221654   <=           
AA7A12226FDD40671C191FAF6AA57733 \SSDTest\SSDCopy_Write_Read.bat                 462 20220805_023811    <=           
3EDC4CE7B65FBC140B7D0F5604F3070D \SSDTest\SSDMixedWrite.ps1                      2476 20220807_115705   <=           
768B150D59F7EA576F375430883CC8DA \SSDTest\SSDRead.ps1                            2044 20220805_021614   <=

How can I go about outputting only entries that match two or more times based on File attribute? So, for example, the above log would result in the following:

D41D8CD98F00B204E9800998ECF8427E \added.txt                                      0 20230107_021401      =>           
D41D8CD98F00B204E9800998ECF8427E \added.txt                                      0 20230107_021409      <=           

6B7CA4894B3CFCBA1ECA6B8BB9656FE8 \dirsize2.bat                                   714 20231010_111350    =>           
E17C51851C46801F9399087B61736E34 \dirsize2.bat                                   711 20230107_000101    <=           

B7D8DA3D9C1387EBCEC0565176385BB7 \SMR FINAL\SMR_PC_TEST_1_FILLRND.log            13014 20220823_134552  =>           
E7A73A93D06EB266E614FC5110BBBF28 \SMR FINAL\SMR_PC_TEST_1_FILLRND.log            13014 20220822_133356  <=           

I did attempt a nested ForEach-Object but that just output single matched files and didn't list both.

It seems like this should be straight forward, but it isn't. Thanks in advance for any assistance.

(If you're wondering about the "Trademark" Symbol ™ I was just making sure special characters were passing through)


Solution

  • In order to achieve your goal you need to be operating on objects with distinct properties instead of on strings:

    …| Format-Table -AutoSize | Out-String -Width 4096
    
    • Format-* cmdlets emit output objects whose sole purpose is to provide formatting instructions to PowerShell's for-display output-formatting system. In short: only ever use Format-* cmdlets to format data for display, never for subsequent programmatic processing. See this answer for more information.

    • By additionally using Out-String, you end up with a single, multiline string that represents the original data in a format for human, not programmatic consumption.

    Remove the Format-Table call from your pipeline - which results in array of objects - and then pipe $diff (or the modified pipeline directly) to Group-Object as follows:

    $groups = 
      $diff | Group-Object File | Where-Object Count -ge 2
    

    This groups all objects by shared .File property values and uses Where-Object to filter them so as to output only those groups with 2 or more elements.

    Each group is an instance of type Microsoft.PowerShell.Commands.GroupInfo, whose .Groups property is a collection of all the elements in the group.

    You can then use $groups to further process the groups programmatically; read on for how to display the resulting groups in a human-friendly fashion.


    To display the resulting groups and their elements group by group, Format-Table is appropriate:

    $diff | Group-Object File -OutVariable groups | Where-Object Count -ge 2 |
      ForEach-Object Group | Format-Table -GroupBy File
    

    Note:

    • ForEach-Object Group (parameter -MemberName is implied) in essence undoes the grouping (it outputs the elements of each group one by one), only for the -GroupBy argument of Format-Table to group them again for display (there's no way around that, as Format-Table -GroupBy itself lacks any filtering capability).

    • The common -OutVariable parameter is used to store the group objects in self-chosen variable $groups (note how the name is specified without the $), so that the groups can be programmatically processed later.

    The above results in for-display output that:

    • prints a header between groups showing the shared .File property value for each group

    • prints a header for each group, namely the names of the properties being displayed and a separator line.

    To tweak this format:

    • Add -HideTableHeaders to omit the for-each-group headers.

    • If you also want to omit the between-groups headers, more work is needed:

      $diff | Group-Object File -OutVariable groups | Where-Object Count -ge 2 |
        ForEach-Object {
          ($_.Group | Format-Table -HideTableHeaders | Out-String).Trim()
          '' # Output an empty line between groups.
        }