Search code examples
phputf-8trim

PHP trim non-letters Unicode


I need to trim a string of all characters except letters from any languages in UTF-8. For an early test this was working fine until obviously I started using UTF-8 non-Latin letters:

<?php
$s = '\$5ı龢abc';
echo '<p>'.$s.'</p>';
while (!preg_match('/([\p{L}]+)/u', $s[0]))
{
 $s = substr($s, 1);
 echo '<p>'.$s.'</p>';
}
?>

This currently outputs the following:

$5ı龢abc

$5ı龢abc

5ı龢abc

ı龢abc

�龢abc

龢abc

��abc

�abc

abc

I would like the final output to be: ı龢abc. I'm not quite sure what I'm missing however?


Solution

  • Using individual character indexing doesn't work, since PHP isn't aware of "characters" in strings, and merely indexes bytes. This is obviously a problem with multi-byte characters. But you're doing it way too manually anyway; just replace all non-letter characters at the beginning of the string:

    $s = preg_replace('/^\P{L}*/u', '', $s);