TStringlist error on loadfromfile : No mapping for the Unicode character exists in the target multi-byte code page

I'm getting the exception below when trying to load a file using the TStringList.LoadFromFile method:

stringlist1.loadfromfile('c:\example.txt');

No mapping for the Unicode character exists in the target multi-byte code page

The file is Unicode, and the error seems to be related to this special character that exists in the file. The example.txt file has only one line, and its content is exactly as shown below:

Ze 🇫🇮

The file contains these bytes:

EF BB BF 5A 65 20 ED A0 BC ED B7 AB ED A0 BC ED B7 AE

Any workarounds?

Solution

Your file claims to be encoded as UTF-8, as evident by the 1st 3 bytes EF BB BF, which are the UTF-8 BOM.

In Delphi 2009+, String is a UTF-16 encoded Unicode string, so LoadFromFile() will see the BOM and try to decode the file bytes from UTF-8 to Unicode, then encode that Unicode data to UTF-16 in memory.

However, after the BOM, the next 3 bytes 5A 65 20 are proper UTF-8, but the rest of your file after that is NOT proper UTF-8. That is why you are getting the exception.

The correct byte sequence for the characters you have shown should look like the following:

EF BB BF 5A 65 20 F0 9F 87 AB F0 9F 87 AE

But your file contains these bytes instead:

EF BB BF 5A 65 20 ED A0 BC ED B7 AB ED A0 BC ED B7 AE

As you can see, the byte sequence F0 9F 87 AB F0 9F in the correct file has been mis-encoded as ED A0 BC ED B7 AB ED A0 BC ED in your bad file.

When processed as UTF-8, the good file decodes to the following Unicode codepoint sequence:

U+005A LATIN CAPITAL LETTER Z
U+0065 LATIN SMALL LETTER E
U+0020 SPACE
U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F
U+1F1EE REGIONAL INDICATOR SYMBOL LETTER I

Whereas your bad file decodes to the following sequence instead:

U+005A LATIN CAPITAL LETTER Z
U+0065 LATIN SMALL LETTER E
U+0020 SPACE
U+D83C HIGH SURROGATE - invalid!
U+DDEB LOW SURROGATE - invalid!
U+D83C HIGH SURROGATE - invalid!
U+DDEE LOW SURROGATE - invalid!

Now, it happens that D83C DDEB D83C DDEE is the proper UTF-16 encoded form of Unicode codepoints U+1F1EB U+1F1EE. This means that your original Unicode text was encoded to UTF-16 first, then the individual UTF-16 code units where incorrectly treated as-is as Unicode codepoints (which they are not) and were then encoded accordingly to UTF-8, thus producing your bad file.

If this is the only file affected, then you can simply replace its bytes with the bytes shown above. But if this is part of a larger encoding process that is producing badly encoded UTF-8 files that you can't load afterwards, then you need to figure out where that incorrect UTF-16 handling is occurring and fix that issue.