I´m trying to port some Delphi code that sends data to a Universe database. In order to make the text legible by the DB we need to encode it in OEM.
In Delphi is done this way:
procedure TForm1.GenerarTablasNLS;
var
i: integer;
begin
for i := 0 to 255 do
begin
TablaUV_NLS[i] := AnsiChar(i);
TablaNLS_UV[i] := AnsiChar(i);
end;
// Nulo final
TablaUV_NLS[256] := #0;
TablaNLS_UV[256] := #0;
OemToCharA(@TablaUV_NLS[1], @TablaUV_NLS[1]);
CharToOemA(@TablaNLS_UV[1], @TablaNLS_UV[1]);
And then we translate our text simply like this
function StringToUniverse(const Value: string): AnsiString;
var
p: PChar;
q: PAnsiChar;
begin
SetLength(Result, Length(Value));
if Value = '' then Exit;
p := Pointer(Value);
q := Pointer(Result);
while p^ <> #0 do
begin
q^ := TablaNLS_UV[Ord(AnsiChar(p^))];
Inc(p);
Inc(q);
end;
end;
I follow the same logic in Python using a dictionary that stores each character translation
class StringUniverseDict(dict):
def __missing__(self, key):
return key
TablaString2UV = StringUniverseDict()
def rellenar_tablas_codificacion():
TablaString2UV['á'] = ' ' # chr(225) = chr(160)
TablaString2UV['é'] = '‚' # chr(233) = chr(130)
TablaString2UV['í'] = '¡' # chr(237) = chr(161)
TablaString2UV['ó'] = '¢' # chr(243) = chr(162)
TablaString2UV['ú'] = '£' # chr(250) = chr(163)
TablaString2UV['ñ'] = '¤' # chr(241) = chr(164)
TablaString2UV['ç'] = '‡' # chr(231) = chr(135)
TablaString2UV['Á'] = 'µ' # chr(193) = chr(181)
TablaString2UV['É'] = chr(144) # chr(201) = chr(144)
TablaString2UV['Í'] = 'Ö' # chr(205) = chr(214)
TablaString2UV['Ó'] = 'à' # chr(211) = chr(224)
TablaString2UV['Ñ'] = '¥' # chr(209) = chr(165)
TablaString2UV['Ç'] = '€' # chr(199) = chr(128)
TablaString2UV['ü'] = chr(129) # chr(252) = chr(129)
TablaString2UV[chr(129)] = '_' # chr(129) = chr(095)
TablaString2UV[chr(141)] = '_' # chr(141) = chr(095)
TablaString2UV['•'] = chr(007) # chr(149) = chr(007)
TablaString2UV['Å'] = chr(143) # chr(197) = chr(143)
TablaString2UV['Ø'] = chr(157) # chr(216) = chr(157)
TablaString2UV['ì'] = chr(141) # chr(236) = chr(141)
This works "fine" as long as I translate using printable characters. For example, the string
"á é í ó ú ñ ç Á Í Ó Ú Ñ Ç"
is translated, in Delphi, to the following bytes:
0xa0 0x20 0x82 0x20 0xa1 0x20 0xa2 0x20 0xa3 0x20 0xa4 0x20 0x87 0x20 0xb5 0x20 0xd6 0x20 0xe0 0x20 0xe9 0x20 0xa5 0x20 0x80 0xfe 0x73 0x64 0x73
(á translates to ' ', which is chr(160) or 0xA0 in hexa. é is '‚' or chr(130), 0x82 in hexa, í is '¡', char(161) or 0xA1 in hexa and so on)
In Python, when I try to encode this to OEM I do the following:
def convertir_string_a_universe(cadena_python):
resultado = ''
for letra in cadena_python:
resultado += TablaString2UV[letra]
return resultado
And then, to get the bytes
txt_registro = convertir_string_a_universe(txt_orig)
datos = bytes(txt_registro, 'cp1252')
With this I get the following bytes:
b'\xa0 \x82 \xa1 \xa2 \xa3 \xa4 \x87 \xb5 \xd6 \xe0 \xe9 \xa5 \x80 \x9a'
My problem is that this OEM encoding uses non-printable characters, like in 'É' = chr(144) (0x90 in hexa). If I try to call bytes(txt_registro, 'cp1252') with an array where I hava translated 'É' into chr(0x90) I get this error:
caracteres_mal = 'Éü'
txt_registro = convertir_string_a_universe(txt_orig)
datos = bytes(txt_registro, 'cp1252')
File "C:\Users\Hector\PyCharmProjects\pyuniverse\pyuniverse\UniverseRegister.py", line 138, in reconstruir_registro_universe
datos = bytes(txt_registro, 'cp1252')
File "C:\Users\Hector\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\x90' in position 0: character maps to <undefined>
How can I do this OEM encoding without raising this UnicodeEncodeError?
This is because cp1252
does not know about chr(0x90)
. If you try with utf-8
instead, it will work.
>>> chr(0x90).encode("utf8")
b'\xc2\x90'
I don't understand why you are trying to convert to cp1252
though: you have applied a custom conversion map and then, with bytes(txt_registro, 'cp1252')
, you are converting your result again to cp1552
.
I think what you are looking for is something like:
datos = bytes(txt_orig, 'uv')
where uv
is your cutom codec.
So you would have to write an encoder and a decoder for it (which is basically what you have done already). Take a look at https://docs.python.org/3/library/codecs.html#codecs.register to register a new codec. The function you will register with it should return a CodecInfo object described upper in the documentation.
import codecs
def buscar_a_uv(codec):
if codec == "uv":
return codecs.CodecInfo(
convertir_string_a_universe, convertir_universe_a_string, name="uv")
else:
return None
codecs.register(buscar_a_uv)
datos = bytes(txt_orig, 'uv')
The encoder/decoder functions should return bytes, so you would need to update convertir_string_a_universe
a bit.