Regex to match all words outside tags and with accents on PHP

I want to mark all words inside a text, except the ones that are inside a tag. Based on the idea from here, I was able to accomplish the following:

preg_replace("/(\b(\p{L}+)\b)(?!([^<]+)?>)/", "<mark>$1</mark>", $input);

Which works fine EXCEPT for some weird behaviors when using with accent. Examples:

lorem ipsúm dolor <a href="#" title="sit">sit</a> amet consectetur
[OK] => <mark>lorem</mark> <mark>ipsúm</mark> <mark>dolor</mark> <a href="#" title="sit"><mark>sit</mark></a> <mark>amet</mark> <mark>consectetur</mark>

ação ipísum
[NOT OK] => <mark>a</mark>çã<mark>o</mark> <mark>ip</mark>í<mark>sum</mark>

Any idea why this is happening and how to fix it? Thanks

Solution

It's all so simple...

A couple things here:

You want to use the UTF-8 modifier u.
Not relevant for your sample text, but you're leaving out graphemes made of letters with combining diacritics like "è". Here encoded as "e" followed by a combining grave accent. To match those you need to add some optional \p{M}.

So the regex grows:

$input = 'lorem <a href="#">foo</a> ação';

echo preg_replace(
    '/\b((?:\p{L}\p{M}*)+)\b(?!([^<]+)?>)/u',
    "<mark>$1</mark>",
    $input
);

Outputs:

<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark>

...except

So far so good, right? Let's add that "è" and see.

$input = 'lorem <a href="#">foo</a> ação evè';

Nets the output:

<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>eve</mark>̀

That ain't right. Turns out the word boundary shorthand \b still acts a bit silly, even in utf-8 mode. So you have to replace it with some negative lookarounds.

While we're at it let's also use \pL in place of \p{L} as the curly braces are optional for single-letter Unicode categories.

Put all together:

$input = 'lorem <a href="#">foo</a> ação evè';

echo preg_replace(
    '/(?<![\pL\pM])((?:\pL\pM*)+)(?![\pL\pM])(?!([^<]+)?>)/u',
    "<mark>$1</mark>",
    $input
);

Outputs:

<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>evè</mark>

Demo at https://eval.in/194139.