Search code examples
phpregexmultibyte-functions

REGEXP to convert any 3 chars or less word to wordVVV


I am trying to convert any occurrence of a word with 3 chars or less to the same word with the string VVV attached to it.
Example: for -> forVVV
I am using none Latin chars (UTF8), hence the MB.
What I have is:

$pattern='\b[.{1,6}]\b';
$text=mb_ereg_replace($pattern,'\0VVV',$text,'me');

What am I missing?

Here is a case study, see it catches nothing:

$text="א אב אבי אביהו מדינה שול של";
$pattern='/\b.{1,6}\b/um';
$text=preg_replace($pattern,'hhh',$text);
echo $text;

Solution

  • You're pattern's not detecting or grouping things right.

    Use \w for word-characters and standard parenthesis instead of square brackets, and you're not evaluating PHP code in the replacement, you're simply referring to captured text segments, so don't need the e flag:

    $pattern = '\b(\w{1,3})\b';
    $text = mb_ereg_replace($pattern, '\0VVV', $text, 'm');

    Alternatively, use preg_replace with the unicode flag:

    $text = preg_replace('/\b\w{1,3}\b/um', '\0VVV', $text)

    If you need to cater for arabic and right-to-left characters, you need to us unicode character properties instead of \w and \b (\w doesn't match letters from all languages, and \b only matches between \w\W and \W\w - which are both broken wrt. non-latin languages.)

    Try this intead:

    $text = preg_replace('/(?

    (and again cos I can't tell whether I need to encode < or not)

    $text = preg_replace('/(?<!\PL)(\pL{1,3})(?:\PL)/um', '\1VVV', $text);