Search code examples
debuggingcharacter-encodingcharnon-ascii-charactersfile-conversion

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions


Character conversion problem. I have a few strings which are incorrectly encoded or decoded. The strings came in an ASCII format CSV file.

The current strings I have are:

N‚met
Tet‹

I know, that the:

"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)

How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?

I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.


Solution

  • I put an expanded version of my comment as answer:

    Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.

    That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)

    The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).

    For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.