I want to mark all words inside a text, except the ones that are inside a tag. Based on the idea from here, I was able to accomplish the following:
preg_replace("/(\b(\p{L}+)\b)(?!([^<]+)?>)/", "<mark>$1</mark>", $input);
Which works fine EXCEPT for some weird behaviors when using with accent. Examples:
lorem ipsúm dolor <a href="#" title="sit">sit</a> amet consectetur
[OK] => <mark>lorem</mark> <mark>ipsúm</mark> <mark>dolor</mark> <a href="#" title="sit"><mark>sit</mark></a> <mark>amet</mark> <mark>consectetur</mark>
ação ipísum
[NOT OK] => <mark>a</mark>çã<mark>o</mark> <mark>ip</mark>í<mark>sum</mark>
Any idea why this is happening and how to fix it? Thanks
A couple things here:
u
.\p{M}
.So the regex grows:
$input = 'lorem <a href="#">foo</a> ação';
echo preg_replace(
'/\b((?:\p{L}\p{M}*)+)\b(?!([^<]+)?>)/u',
"<mark>$1</mark>",
$input
);
Outputs:
<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark>
So far so good, right? Let's add that "è" and see.
$input = 'lorem <a href="#">foo</a> ação evè';
Nets the output:
<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>eve</mark>̀
That ain't right. Turns out the word boundary shorthand \b
still acts a bit silly, even in utf-8 mode. So you have to replace it with some negative lookarounds.
While we're at it let's also use \pL
in place of \p{L}
as the curly braces are optional for single-letter Unicode categories.
$input = 'lorem <a href="#">foo</a> ação evè';
echo preg_replace(
'/(?<![\pL\pM])((?:\pL\pM*)+)(?![\pL\pM])(?!([^<]+)?>)/u',
"<mark>$1</mark>",
$input
);
Outputs:
<mark>lorem</mark> <a href="#"><mark>foo</mark></a> <mark>ação</mark> <mark>evè</mark>
Demo at https://eval.in/194139.