We have a large (~100MB) text file. We need to remove any lines that contain certain phrases. I would like to use PowerShell to replace the current method of doing it, which uses windows grep and is a .bat file.
The problem is, there are about 95 key phrases. any line containing any of these phrases must be removed.
The list of key phrases is contained in "badPhrases.txt"
, line delimited like a regular text file. There are like 100 of them, I don't want to include them in a hard-coded list, but I will if I have to.
I have tried a couple/few comparisons, but my output is always LARGER than my original input file! Or, 0k(empty). What am I doing wrong? I suspect the problem is in the Where-Object
filter, but I could be wrong.
[string[]]$arrayFromFile = Get-Content -Path '.\badPhrases.txt'
get-content ".\inputfile.txt" | Where-Object {$_ -notlike $arrayFromFile} | Out-File ".\clean_data.txt" -Force
I've tried -notlike, -notin -notmatch and -notcontains (while flipping the array & the input object around in ways that seemed logical). Such as...
Where-Object {$arrayFromFile -notin $_}
....
Where-Object {$_ -notcontains $arrayFromFile}
....
Where-Object {$_ -notlike arrayFromFile}
I have searched stackOverflow and googled around and I'm not able to find any links that aren't dead, that address this exact use case. There was a "hey scripting guy" reference, but... the link was dead.
Use Select-String
, which supports multiple search criteria via an array of strings passed to its -Pattern
parameter:
Select-String -NotMatch -SimpleMatch -Pattern (Get-Content -Path .\badPhrases.txt) .\inputfile.txt |
Select-Object -ExpandProperty Line |
Out-File .\clean_data.txt -Force
Character-encoding caveat: In Windows PowerShell, Out-File
creates "Unicode" (UTF-16LE) files by default, where each character is represented by (at least) 2 bytes; in PowerShell [Core] 6+, the default is more sensibly BOM-less UTF-8; use the -Encoding
parameter to control the character encoding explicitly.
-NotMatch
negates the matching, so that only lines not matching any of the pattern strings are output.
-SimpleMatch
ensures that the patterns are matched literally against the lines of the input file; by default, they're interpreted as regular expressions.
Note that matching is case-insensitive by default; use -CaseSensitive
, if needed.
Since Select-String
outputs Microsoft.PowerShell.Commands.MatchInfo
instances by default, Select-Object -ExpandProperty Line
is needed to extract the lines themselves.
Select-String
's -Raw
switch instead.As for what you tried:
$_ -notlike $arrayFromFile
You cannot use an array as the RHS of string-comparison operators such as -like
, -match
, -eq
- you can only match against one string at a time.
(Apart from that, -like
/ -notlike
match against the entire LHS by default; to match a substring of the LHS, you'd have to put *
on either end of the RHS.)
See this answer for more information.
$arrayFromFile -notin $_
$_ -notcontains $arrayFromFile
In principle, you'd have to reverse the operands for containment operators -in
and -contains
and their negations - the syntax is <array> -contains <value>
and <value> -in <array>
- but the problem is that that, again, matching of the entire strings is performed either way, so this approach would only work if $arrayFromFile
contained full lines present in the input (-in
and -contains
implicitly perform per-element -eq
comparisons).