Search code examples
phpcharacter-encodingspecial-charactersansi

How to save thorn with ansi encoding in PHP?


I have to save a text file with ansi encoding, containing the special character "thorn" or "þ" in PHP. When I simply place the thorn in PHP, it is going to look like "ţ" in the file. I have tried many different ways without luck, and have no idea how I could save the thorn correctly. Could you please give me some advice? Thank you.

after iconv('UTF-8', 'Windows-1252', $this->filedata); (mb_convert_encoding() makes no difference either)

þ ==> ţ

utf8_encode("þ") ==> Ăľ

I'm using Netbeans 15 for coding and notepad++ 8.4.8 for checking the results

Something is very strange: I have the PHP generated ANSI Text file, where the thorn looks like ţ and when I copy / paste it to another ANSI text file created with Notepad++, it is inserted simply as a t. When I convert thorn with Notepad++, it's going to be a ?. Maybe a bug in Np++?


Solution

  • Assuming your php file is in utf8, then the following saves the "þ" in Windows-1252 encoding:

    $text = iconv('UTF-8', 'Windows-1252', 'þ');
    file_put_contents('./output.txt', $text);
    

    Your þ will be saved as 0xFE (numeric value: 254).
    Windows-1252 is the same as ISO-8859-1 except for 0x80 to 0x9F.

    If you can check hex dump, you can verify that FE is there (it only takes one byte).


    However, in Windows, a text file in the so-called "ANSI" (non-Unicode) is loaded differently depending on your system locale:

    • If the Windows system locale is Romanian (Romania), an "ANSI" text file is loaded like ISO-8859-2, so the 0xFE byte is loaded as ţ (not thorn, but rather "t with a cedilla"). If you look at ISO-8859-2 code page layout, there is no "thorn" letter.
      Basically, the pre-Unicode encoding for Romanian (or another similar language) does not support the þ character.
    • If the Windows system locale is English (United States), an "ANSI" text file is loaded like ISO-8859-1, so that you can see þ even in a non-Unicode program. But then it does not support ţ. In ISO-8859-1 code page layout, you see that þ is placed exactly where you expect ţ in ISO-8859-2.

    Other system locales may interpret 0xFE differently depending on which pre-Unicode encoding suited their language.

    To change locale (Windows 11): From Settings, find Time & language > Language & region > Administrative language settings. Then click on the Administrative tab. Then you should see "Current language for non-Unicode programs". Then choose "Change system locale..." (you need administrative privilege for this).

    (Note that the locale may be different from the Windows display language).


    Regarding unsupported characters, text editors must find a way to get rid of them (for example, by replacing them with actually supported characters), otherwise they can't exactly save the results (because there is no correct byte representation for your original data in the current encoding).

    Sometimes an unsupported character is simply replaced with ?, sometimes it's another similar letter (like you see how ţ was replaced with t). In any case, you can't save/load the letter þ properly unless the encoding supports that character. Similarly for ţ.

    Notepad++ shows (bottom right) which encoding it is currently using. If you see "ANSI" (and your OS is Windows), then the actual scheme depends on the system locale.