Search code examples
powershellpowershell-4.0

Filtering sections of data including the starting and ending lines- PowerShell


I have a text file that looks like this:

Data I'm NOT looking for  
More data that doesn't matter  
Even more data that I don't

&Start/Finally the data I'm looking for  
&Data/More data that I need  
&Stop/I need this too  

&Start/Second batch of data I need  
&Data/I need this too 
&Stop/Okay now I'm done  
Ending that I don't need  

Here is what the output needs to be:

File1.txt

&Start/Finally the data I'm looking for  
&Data/More data that I need   
&Stop/I need this too  

File2.txt

&Start/Second batch of data I need  
&Data/I need this too 
&Stop/Okay now I'm done  

I need to do this for every file in a folder (sometimes there will be multiple files that will need to be filtered.) The files names can be incrementing: ex. File1.txt, File2.txt, File3.txt.

This is what I have tried with no luck:

ForEach-Object{
$text -join "`n" -split '(?ms)(?=^&START)' -match '^&START' | 
Out-File B:\PowerShell\$filename}

Thanks!


Solution

  • Looks like you were pretty close: your code correctly extracted the paragraphs of interest, but intra-paragraph out-filtering of non-&-starting lines was missing, and you needed to write to paragraph-specific output files:

    $text -join "`n" -split '(?m)(?=^&Start)' -match '^&Start' | 
      ForEach-Object { $ndx=0 } { $_ -split '\n' -match '^&' | Out-File "File$((++$ndx)).txt" }
    

    This creates sequentially numbered files starting with File1.txt for every paragraph of interest.


    To do it for every file in a folder, with output filenames using fixed naming scheme File<n> across all input files (and thus cumulative numbering):

    Get-ChildItem -File . | ForEach-Object -Begin { $ndx=0 } -Process {
      (Get-Content -Raw $_) -split '(?m)(?=^&Start)' -match '^&Start' | 
        ForEach-Object { $_ -split '\n' -match '^&' | Out-File "File$((++$ndx)).txt" }
    }
    

    To do it for every file in a folder, with output filenames based on the input filenames and numbering per input file (PSv4+, due to use of -PipelineVariable):

    Get-ChildItem -File . -PipelineVariable File | ForEach-Object {
     (Get-Content -Raw $_) -split '(?m)(?=^&Start)' -match '^&Start' | 
      ForEach-Object {$ndx=0} { $_ -split '\n' -match '^&' | Out-File "$($File.Name)$((++$ndx)).txt" }
    }