Search code examples
phpunicodeencodingcharacter-encodingunicode-normalization

PHP convert non standard marks and special characters to normal


Is there a way to convert characters like:

É É é à Ç etc

and also this type of exclamation mark with a space after it built in:

To their normal versions. At the moment I have code like this:

$linesvalue = str_replace(["Ç","ç"],"ç",$linesvalue);
$linesvalue = str_replace(["É","É","é"],"é",$linesvalue);
$linesvalue = str_replace("è","è",$linesvalue);
$linesvalue = str_replace("à","à",$linesvalue);
$linesvalue = str_replace("â","â",$linesvalue);
$linesvalue = str_replace("ê","ê",$linesvalue);

They look like they're replacing with the same thing, but they're certainly not. Anyway, this is not too bad but I find when I try to replace the exclamation mark (!) in particular it seems to also replace some accented characters like ü and such.

Is there a way to convert the whole text in advance so its just all standard characters?


Solution

  • Use normalization form C to normalize combining marks like accents. Form KC additionally converts full-width characters like U+FF01 to standard versions.

    Example:

    <?php
    $string = "É É é à Ç !";
    print "before: $string\n";
    print "hex: " . unpack("H*", $string)[1] . "\n";
    $string = Normalizer::normalize($string, Normalizer::FORM_KC);
    print "after: $string\n";
    print "hex: " . unpack("H*", $string)[1] . "\n";
    

    Output:

    before: É É é à Ç !
    hex: c3892045cc812065cc812061cc802043cca720efbc81
    after: É É é à Ç !
    hex: c38920c38920c3a920c3a020c3872021