Search code examples
powershelltext-parsing

Find lines between a pattern, and append 1st line to lines


I have the following case I'm trying to script in Powershell. I have done this exercise using Sed on a bash terminal, but having trouble writing in Powershell. Any help would be greatly appreciated.
(sed -r -e '/^N/h;/^[N-]/d;G;s/(.*)\n(.*)/\2 \1/' <file>, with a file format without < and > chars. surrounding the first letter on each line)

The start pattern always start with a <N> (only 1 instance per block), lines between start with a <J>, and the end pattern is always --

--------------
<N>ABC123
<J>SomethingHere1
<J>SomethingHere2
<J>SomethingHere3
--------------    <-- end of section

I'm trying to take the first line in each section <N> and copy it AFTER each <J> in the same section. For example:

<J>SomethingHere1    <N>ABC123
<J>SomethingHere2    <N>ABC123
<J>SomethingHere3    <N>ABC123

The number of <J> lines per section can vary (0-N). In a case with no <J>, nothing needs to be done.

Powershell version:5.1.16299.611


Solution

  • The following, pipeline-based solution isn't fast, but conceptually straightforward:

    Get-Content file.txt | ForEach-Object {
      if ($_ -match '^-+$') { $newSect = $true }
      elseif ($newSect) { $firstSectionLine = $_; $newSect = $False }
      else { "{0}`t{1}" -f $_, $firstSectionLine }
    }
    
    • It reads and processes lines one by one (with the line at hand reflected in automatic variable $_.

    • It uses a regex (^-+) with the -match operator to identify section dividers; if found, flag $newSect is set to signal that the next line is the section's first data line.

    • If the first data line is hit, it is cached in variable $firstSectionLine, and the $newSect flag is reset.

    • All other lines are by definition the lines to which the first data line is to be appended, which is done via the -f string-formatting operator, using a tab char. (`t) as the separator.


    Here's a faster PSv4+ solution that is more complex, however, and it reads the entire input file into memory up front:

    ((Get-Content -Raw file.txt) -split '(?m)^-+(?:\r?\n)?' -ne '').ForEach({
      $firstLine, $otherLines = $_ -split '\r?\n' -ne ''
      foreach ($otherLine in $otherLines) { "{0}`t{1}" -f $otherLine, $firstLine }
    })
    
    • Get-Content -Raw reads in the input file in full, as a single string.

    • It uses the -split operator to split the input file into sections, and then processes each section.

    • Regex '(?m)^-+(?:\r?\n)?' matches a section divider line, optionally followed by a newline.

      • (?m) is the multiline option, which makes ^ and $ match the start and end of each line, respectively:
      • \r?\n matches a newline, either in CRLF (\r\n) or LF-only (\n) form.
      • (?:...) is a non-capturing group; making it non-capturing prevents what it matches from being included in the elements returned by -split.
      • -ne '' filters out resulting empty elements.
    • -split '\r?\n' splits each section into individual lines.

    If performance is still a concern, you could speed up reading the file with [IO.File]::ReadAllText("$PWD/file.txt").