Converting widechars to system ANSI encoding in Python

I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not. Samples of the text :

Zażółć gęślą jaźń

is represented by:

ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„ In this case it is damaged CP1250 as per answer below. However: ⚠️

is represented by: âš ď¸Ź

⏰ is represented by: âŹ° and 高生旺 is represented by: é«ç”źć—ş

I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.

Is it possible to simulate this behavior in Python?

The solution needs to work in both Python 2 and 3.

At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.

Solution

your utf-8 is decoded as cp1250.

What I did in python3 is this:

orig = "Zażółć gęślą jaźń"
wrong = "ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„"

for enc in range(437, 1300):
    try:
        res = orig.encode().decode(f"cp{enc}")
        if res == wrong:
            print('FOUND', res, enc)
    except:
        pass

...and the result was the 1250 codepage.

So your solution shall be:

import sys

def restore(garbaged):
    # python 3
    if sys.version_info.major > 2:
        return garbaged.encode('cp1250').decode()
    # python 2
    else:
        # is it a string
        try:
            return garbaged.decode('utf-8').encode('cp1250')
        # or is it unicode
        except UnicodeEncodeError:
            return garbaged.encode('cp1250')

EDIT:

The reason why "高生旺" can not be recovered from é«ç”źć—ş:

"高生旺".encode('utf-8') is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'.

The problem is the \x98 part. In cp1250 there is no character set for that value. If you try this:

"高生旺".encode('utf-8').decode('cp1250')

You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>

The way to get "é«ç”źć—ş" is:

"高生旺".encode('utf-8').decode('cp1250', 'ignore')

But the ignore part is critical, it causes data loss:

'é«ç”źć—ş'.encode('cp1250') is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'.

If you compare these two:

b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'

you will see that the \x98 character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte.

If you try this:

'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')

The result will be '\\xe9\\xab生旺'. \xe9\xab\x98 could be decoded to 高, from \xe9\xab it is not possible.