Search code examples
arraysregexpowershelldata-processing

Powershell remove any lines from big text file containing any of a large number of strings


We have a large (~100MB) text file. We need to remove any lines that contain certain phrases. I would like to use PowerShell to replace the current method of doing it, which uses windows grep and is a .bat file.

The problem is, there are about 95 key phrases. any line containing any of these phrases must be removed.

The list of key phrases is contained in "badPhrases.txt", line delimited like a regular text file. There are like 100 of them, I don't want to include them in a hard-coded list, but I will if I have to.

I have tried a couple/few comparisons, but my output is always LARGER than my original input file! Or, 0k(empty). What am I doing wrong? I suspect the problem is in the Where-Object filter, but I could be wrong.

[string[]]$arrayFromFile = Get-Content -Path '.\badPhrases.txt'
get-content ".\inputfile.txt" | Where-Object {$_ -notlike $arrayFromFile} | Out-File ".\clean_data.txt" -Force

I've tried -notlike, -notin -notmatch and -notcontains (while flipping the array & the input object around in ways that seemed logical). Such as...

Where-Object {$arrayFromFile -notin $_}
....
Where-Object {$_ -notcontains $arrayFromFile}
....
Where-Object {$_ -notlike arrayFromFile}

I have searched stackOverflow and googled around and I'm not able to find any links that aren't dead, that address this exact use case. There was a "hey scripting guy" reference, but... the link was dead.


Solution

  • Use Select-String, which supports multiple search criteria via an array of strings passed to its
    -Pattern parameter:

    Select-String -NotMatch -SimpleMatch -Pattern (Get-Content -Path .\badPhrases.txt) .\inputfile.txt |
     Select-Object -ExpandProperty Line | 
       Out-File .\clean_data.txt -Force
    

    Character-encoding caveat: In Windows PowerShell, Out-File creates "Unicode" (UTF-16LE) files by default, where each character is represented by (at least) 2 bytes; in PowerShell [Core] 6+, the default is more sensibly BOM-less UTF-8; use the -Encoding parameter to control the character encoding explicitly.

    • -NotMatch negates the matching, so that only lines not matching any of the pattern strings are output.

    • -SimpleMatch ensures that the patterns are matched literally against the lines of the input file; by default, they're interpreted as regular expressions.

    • Note that matching is case-insensitive by default; use -CaseSensitive, if needed.

    • Since Select-String outputs Microsoft.PowerShell.Commands.MatchInfoinstances by default, Select-Object -ExpandProperty Line is needed to extract the lines themselves.

      • Note: In PowerShell 7+, you can use Select-String's -Raw switch instead.

    As for what you tried:

    $_ -notlike $arrayFromFile

    You cannot use an array as the RHS of string-comparison operators such as -like, -match, -eq - you can only match against one string at a time.

    (Apart from that, -like / -notlike match against the entire LHS by default; to match a substring of the LHS, you'd have to put * on either end of the RHS.)

    See this answer for more information.

    $arrayFromFile -notin $_

    $_ -notcontains $arrayFromFile

    In principle, you'd have to reverse the operands for containment operators -in and -contains and their negations - the syntax is <array> -contains <value> and <value> -in <array> - but the problem is that that, again, matching of the entire strings is performed either way, so this approach would only work if $arrayFromFile contained full lines present in the input (-in and -contains implicitly perform per-element -eq comparisons).