python pandas postgresql unicode psycopg2

Pandas read_sql_table from postgres+psycopg2 fails with UnicodeDecodeError

I have to download the information from a PostgreSQL server (that we don't have any control over) to CSV for some non-critical analysis (basically we're looking for tables containing a specific string in any row or column), so I decided to use Pandas read_sql_table to do this, but I keep getting an UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte error on some tables, after researching through other SO questions, I've changed the client encoding to UTF8 but the error still happens. The server encoding is SQL_ASCII.

An oversimplified version of my script would look like this:

ENCODING = 'utf8'

conn_str = f"postgresql+psycopg2://{config['DBUSER']}:{config['DBPASS']}@{config['DBHOST']}/{config['DBNAME']}"

engine = create_engine(conn_str, client_encoding=ENCODING, pool_recycle=36000)
conn = engine.connect()

server = self.conn.execute("SHOW SERVER_ENCODING").fetchone()
print("Server Encoding ", server.server_encoding)
client = self.conn.execute("SHOW CLIENT_ENCODING").fetchone()
print("Client Encoding ", client.client_encoding)

df = pandas.read_sql_table(VIEWNAME, conn, SCHEMA)

Outputs:

Server Encoding  SQL_ASCII
Client Encoding  UNICODE
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte

As far as I can tell this issue is with the underlying SQLAlchemy connection, so I would like to solve that at the connection level, and if that's not possible, I can get away with downloading all the non problematic rows in the table but it seems there's no support to do something like that

Solution

Per the Postgres docs here Localization/Character Set:

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

This means the data can be in a variety of encodings once you get ASCII 127. Generally it turns out the data was entered as one of the Windows code pages. If you know where the data originally came from you maybe able to narrow down the choices. Then you can try setting the client_encoding to the appropriate code page to see if the export succeeds.

Going forward it is a good idea to move the server_encoding off SQL_ASCII, to make life easier.