Search code examples
powershelltext-parsingfilesplitting

How can I modify this Powershell Script?


I have a single text file that contains 60K+ lines in it. Those 60K+ lines are actually around 50 or so programs written in Natural. I need to break them apart into individual programs. I have a script that works perfectly with a single flaw. The naming of the output files.

Every program starts with "Module Name=", followed by the actual name of the program. I need to split the programs and save them using the actual program names.

Using the example below, I would like to create two files called Program1.txt and Program2.txt each containing the lines belonging to them. I have a script, also below, that separates the files correctly, but I am unable to discern the correct way to capture the Program name and use that as the name of the output file.

Example:

Module Name=Program1
....
....
....
END

Module Name=Program2
....
....
....
END

Code:

$InputFile = "C:\Natural.txt"
$Reader = New-Object System.IO.StreamReader($InputFile)
$a = 1
While (($Line = $Reader.ReadLine()) -ne $null) {
    If ($Line -match "Module Name=") {
        $OutputFile = "MySplittedFileNumber$a.txt"
        $a++
    }    
    Add-Content $OutputFile $Line
}

Solution

  • Combine a switch statement, which can read files line by line efficiently with -File and can match each line against regex(es) with -Regex, and use a System.IO.StreamWriter instance to write the output files efficiently:

    $outStream = $null
    
    switch -Regex -File C:\Natural.txt {
      '\bModule Name=(\w+)' {   # a module start line
        if ($outStream) { $outStream.Close() }
        $programName = $Matches[1] # Extract the program name.
        # Create a new output file.
        # Important: use a *full* path.
        $outStream = [System.IO.StreamWriter] "C:\$programName.txt"
        # Write the line at hand.
        $outStream.WriteLine($_)
      }
      default {                 # all other lines
        # Write the line at hand to the current output file.
        $outStream.WriteLine($_)    
      }
    }
    if ($outStream) { $outStream.Close() }
    
    

    Note:

    • The code assumes that the very first line in the input file is a Module Name=... line.

    • The regex matching is case-insensitive by default, as PowerShell generally is; add -CaseSensitive, if needed.

    • The automatic $Matches variable is used to extract the program name from the matching result.