python postgresql character-encoding database-migration iso-8859-2

Problem with converting polish characters (ISO-8859-2) to Postgres with encoding UTF-8 in Python

With respect to a project I am working on, I have many .unl files (informix) for different countries and these need to be imported into postgres. To do that, I need to translate an informix schema to postgres schema using python.

Assuming I have this line of code in my python script with which I want to open all .unl files:

open(file, 'r', encoding='latin1')

For countries that use encoding = latin1, the script works fine and things look good in postgres. Except for Poland

When I specify encoding = latin2 for Poland, the import script is still successful executed but the polish text ends up looking different in postgres. An example, the output looks like this unexpectedly:

But if the encoding is correct, the expected result should look like this:

I tried and still can't figure out yet how to fix it. I really appriciate any suggestions on how to solve this problem. Thank you in advance!

Solution

You face a flagrant mojibake case.

Proof in the following (partially commented) code snippet: type .\SO\78540135.py

file = r'.\SO\78540135.txt'
str_text = 'Aleksańdra Świętochowskiego'

# create a sample file: utf-8 encoded
with open( file, 'w', encoding = 'utf-8') as f:
    f.write( str_text)

# read the file using wrong encoding
with open( file, 'r', encoding = 'latin2') as f:
    str_name = f.read()

print( '\nmojibake', str_name)

# read the file using correct encoding
with open( file, 'r', encoding = 'utf-8') as f:
    str_name = f.read()

print( '\nUTF8text', str_name)

Output: python .\SO\78540135.py

mojibake AleksaĹdra ĹwiÄtochowskiego

UTF8text Aleksańdra Świętochowskiego