Search code examples
phpregexdiacriticswords

Regex to match all words outside tags and with accents on PHP


I want to mark all words inside a text, except the ones that are inside a tag. Based on the idea from here, I was able to accomplish the following:

preg_replace("/(\b(\p{L}+)\b)(?!([^<]+)?>)/", "<mark>$1</mark>", $input);

Which works fine EXCEPT for some weird behaviors when using with accent. Examples:

lorem ipsúm dolor <a href="#" title="sit">sit</a> amet consectetur
[OK] => <mark>lorem</mark> <mark>ipsúm</mark> <mark>dolor</mark> <a href="#" title="sit"><mark>sit</mark></a> <mark>amet</mark> <mark>consectetur</mark>

ação ipísum
[NOT OK] => <mark>a</mark>çã<mark>o</mark> <mark>ip</mark>í<mark>sum</mark>

Any idea why this is happening and how to fix it? Thanks


Solution

  • It's all so simple...

    A couple things here:

    1. You want to use the UTF-8 modifier u.
    2. Not relevant for your sample text, but you're leaving out graphemes made of letters with combining diacritics like "è". Here encoded as "e" followed by a combining grave accent. To match those you need to add some optional \p{M}.

    So the regex grows:

    $input = 'lorem <a href="#">foo</a> ação';
    
    echo preg_replace(
        '/\b((?:\p{L}\p{M}*)+)\b(?!([^<]+)?>)/u',
        "<mark>$1</mark>",
        $input
    );
    

    Outputs:

    <mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark>
    

    ...except

    So far so good, right? Let's add that "è" and see.

    $input = 'lorem <a href="#">foo</a> ação evè';
    

    Nets the output:

    <mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>eve</mark>̀
    

    That ain't right. Turns out the word boundary shorthand \b still acts a bit silly, even in utf-8 mode. So you have to replace it with some negative lookarounds.

    While we're at it let's also use \pL in place of \p{L} as the curly braces are optional for single-letter Unicode categories.


    Put all together:

    $input = 'lorem <a href="#">foo</a> ação evè';
    
    echo preg_replace(
        '/(?<![\pL\pM])((?:\pL\pM*)+)(?![\pL\pM])(?!([^<]+)?>)/u',
        "<mark>$1</mark>",
        $input
    );
    

    Outputs:

    <mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>evè</mark>
    

    Demo at https://eval.in/194139.