Search code examples
phpregex

How to build a regex pattern from a list of words (in a string or array) and surround each word with word boundary anchors?


I have a string of text and a series of words:

String: Auch ein blindes Huhn findet einmal ein Korn.
Words: auch, ein, sehendes

I want to check which of the words is contained in the string. I am using preg_match_all for that:

$pattern = "/\bauch\b|\bein\b|\bsehendes\b/i";
$subject = "Auch ein blindes Huhn findet einmal ein Korn.";

preg_match_all($pattern, $subject, $matches);
print_r($matches);

Array
(
    [0] => Array
        (
            [0] => Auch
            [1] => ein
            [2] => ein
        )

)

This works as expected, but since I have to edit the pattern frequently, and I find it confusing to find and replace words when they are all surrounded by word boundary anchors (\b), I would like to define the list of words without the word boundary anchors and add them in a second step. Something like:

$pattern = "auch|ein|sehendes";
$pattern = "/\b" . $pattern . "\b/i";

That, of course, doesn't work as expected.

I could fill an array with the words and loop over it to build the pattern, but I'd like to avoid loops. Any ideas how to do this fast? The real string and the real number of words are quite large.

Eventually I need the number of matches, as it is returned by preg_match_all. For the example the expected output is "3".


Here is a similar question where this is done in Javascript: Apply a word-boundary anchor to all tokens in a single regex


Solution

  • You may use an alternation of keywords, e.g.

    $pattern = "/\b(?:auch|ein|sehendes)\b/i";
    $subject = "Auch ein blindes Huhn findet einmal ein Korn.";
    preg_match_all($pattern, $subject, $matches);
    print_r($matches);
    
    Array
    (
        [0] => Array
            (
                [0] => Auch
                [1] => ein
                [2] => ein
            )
    )
    

    Note: The ?: inside the alternation (?...) simply tells PHP to turn off the capture group. There is nothing wrong with leaving the capture group on, but we don't need it here, and so it is better to not use it.