Search code examples
pythonsqlalchemyteradatateradatasql

Encoding problem while querying data from TeradataDB to dataframe in Python


I have this problem that I can't solve. We have table in Teradata database. Let's say that our table has only one row (Świdnica - Polish word) and one column (addresses).

Column 'addresses' is defined as:

addresses VARCHAR(80) CHARACTER SET LATIN NOT CASESPECIFIC

Let's say that we query the data:

select addresses from WH_table

and in Teradata Studio we get nice result:

ŚWIDNICA

but when I try to query using extension to SQLAlchemy named teradatasqlalchemy I get:

'\x8cWIDNICA'

My code looks like this:

import sqlalchemy as sqlal
import pandas as pd

engine = sqlal.create_engine('teradatasql://ip_to_my_db/?user=user&password=pass!&tmode=TERA', pool_size=5)
conn = engine.connect()

try:
    query = """select addresses from WH_table"""
    data = pd.read_sql(query, con = conn)
    print(data)

except Exception as e:
    print(e)

As far as I know teradatasqlalchemy query returns data as UTF-8. I tried encoding and decoding data using UTF8, Latin1, cp1252 and iso-8859-1, but I still can't get it to just return 'ŚWIDNICA'. I could replace \x8c to Ś, but probably as you would expect. It will work for a couple of examples and not for whole table which has more then thousand rows.


Solution

  • Using the answers from comment under the original question I figured out a way to make it work. It's not the most optimized answer, but works for small dataframes.

    import sqlalchemy as sqlal
    import pandas as pd
    
    engine = sqlal.create_engine('teradatasql://ip_to_my_db/?user=user&password=pass!&tmode=TERA', pool_size=5)
    conn = engine.connect()
    
    try:
        query = """select addresses from WH_table"""
        data = pd.read_sql(query, con = conn) # read data from db using query
        for column, series in data.items():
            for item in series:
                # change string to hex
                str_to_hex = ''.join(hex(ord(x))[2:] for x in item) 
                # change hex to bytes and decode using encoding of your choice
                hex_to_word = bytes.fromhex(str_to_hex).decode('cp1250')
                # replace the word in df
                series.replace(item, hex_to_word, inplace = True)
    
    except Exception as e:
        print(e)
    

    It works well for small df, but going through each item in df is not very efficient as you can imagine.