Search code examples
phpunicodefpdfcjk

How to convert Unicode NCR form to its original form in PHP?


To avoid "monster characters", I choose Unicode NCR form to store non-English characters in database (MySQL). Yet, the PDF plugin I use (FPDF) do not accept Unicode NCR form as a correct format; it displays the data directly like:

這個一個例子

but I want it to display like:

這個一個例子

Is there any method to convert Unicode NCR form to its original form?

p.s. the meaning of the sentence is "this is an example" in Traditional Chinese.

p.s. i know NCR form wastes storage space, but it is the safest method to store non-English characters. Correct me if I am wrong. thanks.


Solution

  • There is a simpler solution, using the PHP mbstring extension.

    // convert any Decimal NCRs to Unicode characters
    $string = "這個一個例子";
    $output = preg_replace_callback(
      '/(&#[0-9]+;)/u', 
      function($m){
        return utf8_entity_decode($m[1]);
      }, 
      $string
    );
    echo $output; // 這個一個例子
    
    //callback function for the regex
    function utf8_entity_decode($entity){
      $convmap = array(0x0, 0x10000, 0, 0xfffff);
      return mb_decode_numericentity($entity, $convmap, 'UTF-8');
    }
    

    The 'utf8_entity_decode' function is from PHP.net (Andrew Simpson): http://php.net/manual/ru/function.mb-decode-numericentity.php#48085. I modified the code slightly to avoid the deprecated 'e'-modifier within the Regex.