Search code examples
pythonsql-serverunicodepyodbcpypyodbc

What's the cause of this UnicodeDecodeError with an nvarchar field using pyodbc and MSSQL?


I can read from a MSSQL database by sending queries in python through pypyodbc.

Mostly unicode characters are handled correctly, but I've hit a certain character that causes an error.

The field in question is of type nvarchar(50) and begins with this character "􀄑" which renders for me a bit like this...

-----
|100|
|111| 
-----

If that number is hex 0x100111 then it's the character supplementary private use area-b u+100111. Though interestingly, if it's binary 0b100111 then it's an apostrophe, could it be that the wrong encoding was used when the data was uploaded? This field is storing part of a Chinese postal address.

The error message includes

UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here it is in full...

Traceback (most recent call last):   File "question.py", line 19, in <module>
    results.fetchone()   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone
    value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer)))   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec
    uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding)   File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here's some minimal reproducing code...

import pypyodbc

connection_string = (
    "DSN=sqlserverdatasource;"
    "UID=REDACTED;"
    "PWD=REDACTED;"
    "DATABASE=obi_load")

connection = pypyodbc.connect(connection_string)

cursor = connection.cursor()

query_sql = (
    "SELECT address_line_1 "
    "FROM address "
    "WHERE address_id == 'REDACTED' ")

with cursor.execute(query_sql) as results:
    row = results.fetchone() # This is the line that raises the error.
    print row

Here is a chunk of my /etc/freetds/freetds.conf

[global]
;   tds version = 4.2
;   dump file = /tmp/freetds.log
;   debug flags = 0xffff
;   timeout = 10
;   connect timeout = 10
    text size = 64512

[sqlserver]
host = REDACTED
port = 1433
tds version = 7.0
client charset = UTF-8

I've also tried with client charset = UTF-16 and omitting that line all together.

Here's the relevant chunk from my /etc/odbc.ini

[sqlserverdatasource]
Driver = FreeTDS
Description = ODBC connection via FreeTDS
Trace = No
Servername = sqlserver
Database = REDACTED

Here's the relevant chunk from my /etc/odbcinst.ini

[FreeTDS]
Description = TDS Driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 1

I can work around this issue by fetching results in a try/except block, throwing away any rows that raise a UnicodeDecodeError, but is there a solution? Can I throw away just the undecodable character, or is there a way to fetch this line without raising an error?

It's not inconceivable that some bad data has ended up on the database.

I've Googled around and checked this site's related questions, but have had no luck.


Solution

  • This problem was eventually worked around, I suspect that the problem was that text had a character of one encoding hammered into a field with another declared encoding through some hacky method when the table was being set up.