Search code examples
phpregexpcreregex-negationheredoc

Regex to match (1 or more) php heredocs containing an empty line


Example text at: https://regex101.com/r/tfYEkO/1

I want to find heredocs in php code that contain an empty line.

I can do that using this regex, but if there are 2 heredocs in a file, it'd match from the start of the first to the end of the second:

<<<([A-Z]+)\n.*\n\n.*\n *\1\b

So I thought negative lookaheads would solve it, but that doesn't match anything:

<<<([A-Z]+)\n(?!.*\1.*).*\n\n(?!.*\1.*).*\n *\1\b

I don't think I can use negative lookbehinds with the .* in it. I tried the ungreedy flag, but that didn't seem to change it.

FYI, a heredoc in php starts with <<< and a keyword, and ends with that keyword on it's own line:

$foo = <<<HTML
This is the string that is returned.

It can contain multiple lines.
HTML;

Solution

  • You may use

    '~<<<([A-Za-z_]\w*)(?:\R(?!\1;\R).*)*\R(?:\R(?!\1;\R).*)*\R\1;\R~'
    

    See the regex demo

    To make it compliant with the PHP 7.3 more lax requirements (the closing marker can now be indented and the new line requirement after the closing marker is removed), use

    '~<<<([A-Za-z_]\w*)(?:\R(?!\h*\1;$).*)*\R(?:\R(?!\h*\1;$).*)*\R\h*\1;$~m'
    

    See another regex demo.

    Details

    • <<< - a literal <<< substring
    • ([A-Za-z_]\w*) - Group 1: a valid PHP label (must contain only alphanumeric characters and underscores, and must start with a non-digit character or underscore)
    • (?:\R(?!\1;\R).*)* - 0 or more repetitions of a line break (\R) not followed with the same value as in Group 1 followed with ; and a line break, and then the whole line (.*)
    • \R - a line break
    • (?:\R(?!\1;\R).*)* - see above (note that in case of (?!\h*\1;$), it means "not followed with 0+ horizontal whitespaces, Group 1 value and ; at the end of the line"
    • \R - a line break
    • \1 - same value as in Group 1
    • ; - a semi-colon
    • \R - a line break / $ - end of a line (with m modifier, $ matches line end, not string end).