Search code examples
phpregexregex-group

What is any regular expression that matches and capture a multiline string predeced by an undefined number of new lines? [PCRE]


I have this multi-line string:

Lorem ipsum dolor sit amet.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus
dictum, lorem et fringilla congue, velit libero sagittis eros, id
lobortis nisi risus ac mauris.

I would like to use PHP Compatible Regular Expression to "name capture" the second "paragraph" (the 3-line text after the new line).

I tried the following regular expression on regex101 and it works fine :

/\n(\n)+(?<namedGroup>([\w\d]+.*(\n)?)+)/m

but when I tried it in PHP using the following code, nothing gets captured :

<?php
$text = file_get_contents("paragraphs.txt");

$regular_expression = '/\n(\n)+(?<namedGroup>([\w\d]+.*(\n)?)+)/m';

preg_match($regular_expression, $text, $result);
print_r($result);
?>

Solution

  • Currently you are using the pattern like this, for which there can be some improvements:

    $regular_expression = '/\n(\n)+(?<namedGroup>([\w\d]+.*(\n)?)+)/m';
    

    You are only matching a newline \n and apparently you have \r\n in your file. To match those you can use \R to match any Unicode newline sequence.

    If you want to match only a single value for (?<namedGroup> you can actually omit that group at all when making use of \K to discard what is matched so far.

    Note that:

    • [\w\d] is the same as \w as that also matches digits
    • Your pattern has a total of 4 capture groups, where only the named capture group would suffice
    • You don't need the /m multiline flag, as there are no anchors in the pattern
    • The pattern matches only lines that start with a word character \w
    • Not relevant for the match of interest, but repeating a capture group like this (\n)+ only captures the value of the last iteration

    The updated pattern that you could use for a single match:

    \R{2,}\K\w.*(?:\R\w.*)*
    
    • \R{2,} Match 2 or more Unicode newline sequences
    • \K Forget what is matched so far
    • \w.* Match a word character and the rest of the line
    • (?:\R\w.*)* Optionally repeat a Unicode newline sequence, a word character and the rest of the line

    Or match only lines that start with a non whitespace character \S

    \R{2,}\K\S.*(?:\R\S.*)*
    

    Regex demo | Php demo