Search code examples
regexpowershellvisual-studio-codemultiline

Multiline regex match in PowerShell with or without lookahead


I'm trying to format markdown file so that there is one empty line after the headings, the file is UTF8 encoded with CRLF line breaks, here is example file:

## DESCRIPTION
description entry...

## EXAMPLES

### EXAMPLE 1
```
some example here...
```

## OUTPUTS
## NOTES

Here I want to find all headings that are not followed by empty line, assuming file name is file.md here is sample code whose only purpose is to match headings that lack empty line:

$FileData = Get-Content file.md

if ($FileData -match '(?m)^#+\s.*$\s*^.+') { $Matches }

Expected output:

## DESCRIPTION
### EXAMPLE 1
## OUTPUTS

Actual output:

<no output>

Other regex attempts are as follows, but none works:

(?m)^#+\s.*\n*^.+
(?m)^#+\s.*\r\n*^.+
^#+\s.*$(?=\n^.+)
^#+\s.*$(?=\r\n^.+)
^#+\s.*$(?=\s^.+)

Nothing is matched, these regexes are supposed to work because with little modification for VSCode they work just fine, but not in PowerShell, for example:

^#+\s.*$(?=\n^.+) work just fine for VSCode engine, the \n is used for VSCode but in PowerShell (?m) should be used or \r\n or \n but none of these constructs works.

I'm sure somebody has answer to this, but please include in your answer why both the (?m) and \r\n don't work and how to make use of both of them in this specific scenario?

EDIT:

According to comment by Wiktor I tried his suggestion but it doesn't give me the result I want:

$FileData = Get-Content file.md -Raw

foreach ($Line in $FileData) {
if ($Line -match '^#+\s.*$(?=\s^.+)') { $Matches }  
}

I tried all sample regexes posted here, but the output is wrong or no output for all of them


Solution

  • You need to make sure you send the whole file as a single variable to regex usign -Raw option.

    Then, you need to make sure the pattern works in multiline mode, you can use

    (?m)^#+[\p{Zs}\t].*$(?=\n.)
    

    See the regex demo.

    • (?m) - now, ^ matches start of a line and $ matches the end of a line
    • ^ - start of a line
    • #+ - one or more # chars
    • [\p{Zs}\t] - any horizontal whitespace
    • .* - any zero or more chars other than a newline/line feed
    • $ - end of line (position before a newline char)
    • (?=\n.) - a positive lookahead that makes sure there is a newline and any char other than a newline immediately to the right of the current location.

    In Powershell, you can use

     Get-Content 'c:\1\1.txt' -Raw | Select-String '(?m)^#+[\p{Zs}\t].*$(?=\n.)' -AllMatches | Foreach {$_.Matches} | Foreach-Object {$_.Value}