Search code examples
phpencodingutf-8ansi

utf8_encode() not able to convert properly few non-English/Diacritic characters


I have very weird situation. following text in my csv file and file shows as ANSI encoding on notpade++. enter image description here

Œœ Ÿ 654123 áÁàÀâÂäÄãÃåÅæÆçÇéÉèÈêÊëËíÍìÌîÎïÏñÑóÓòÒôÔöÖõÕŒœúÚùÙûÛüÜÿŸ

And following are my sample code:

<?php
header('Content-Type: text/html; charset=UTF-8');

$handle = fopen("unicode.csv", "r");


while (($line = fgets($handle)) !== FALSE)
{
    $cur_encoding = mb_detect_encoding($line) ; 
    if($cur_encoding == "UTF-8" && mb_check_encoding($line,"UTF-8")) 
    {
        echo "\r\n UTF-8".$line; 
    }
    else 
    {
        echo "\r\n encode UTF-8".utf8_encode($line); 
    }
}?>

issue with code which i have found:

  1. not able to detect the encoding.
  2. two characters are missing. (Œœ and Ÿ)

Please help me to find why these two characters are missing. another strange behaviour is that it show the character in Chrome but not in FF or IE Note: I am able to read successfully if i convert encoding to UTF-8 using Notepad++. so please do not suggest this solution. Get the csv file here


Solution

  • That file is encoded in Codepage 1252 a.k.a. MS-ANSI a.k.a. WINDOWS-1252 a.k.a. Windows Latin 1. To convert it to UTF-8:

    echo iconv('CP1252', 'UTF-8', file_get_contents('unicode.csv'));