I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not. Samples of the text :
Zażółć gęślą jaźń
is represented by:
Zażółć gęślÄ… jaźń In this case it is damaged CP1250 as per answer below. However: ⚠️
is represented by: ⚠️
⏰ is represented by: ⏰ and 高生旺 is represented by: é«ç”źć—ş
I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.
Is it possible to simulate this behavior in Python?
The solution needs to work in both Python 2 and 3.
At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.
your utf-8 is decoded as cp1250.
What I did in python3 is this:
orig = "Zażółć gęślą jaźń"
wrong = "Zażółć gęślą jaźń"
for enc in range(437, 1300):
try:
res = orig.encode().decode(f"cp{enc}")
if res == wrong:
print('FOUND', res, enc)
except:
pass
...and the result was the 1250 codepage.
So your solution shall be:
import sys
def restore(garbaged):
# python 3
if sys.version_info.major > 2:
return garbaged.encode('cp1250').decode()
# python 2
else:
# is it a string
try:
return garbaged.decode('utf-8').encode('cp1250')
# or is it unicode
except UnicodeEncodeError:
return garbaged.encode('cp1250')
EDIT:
The reason why "高生旺"
can not be recovered from é«ç”źć—ş
:
"高生旺".encode('utf-8')
is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'
.
The problem is the \x98
part. In cp1250 there is no character set for that value. If you try this:
"高生旺".encode('utf-8').decode('cp1250')
You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>
The way to get "é«ç”źć—ş"
is:
"高生旺".encode('utf-8').decode('cp1250', 'ignore')
But the ignore
part is critical, it causes data loss:
'é«ç”źć—ş'.encode('cp1250')
is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
.
If you compare these two:
b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'
you will see that the \x98
character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
.
If you try this:
'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')
The result will be '\\xe9\\xab生旺'
. \xe9\xab\x98
could be decoded to 高
, from \xe9\xab
it is not possible.