Search code examples
pythonjsonpython-2.7utf-8

Python json.dumps of a tuple with some UTF-8 characters, either fails or converts them. I want the encoded character retained as is


On my server, a Python script gets data from a database as a tuple. Then the script converts the tuple to a string (using json.dumps()) to be passed to the JavaScript script in the user's browser.

The data include German names such as Weidmüller. When the Python scrip gets that data, it returns it as Weidm\xfcller, where \xfc is the UTF-8 encoding of ü. So far so good.

However,

  • json.dumps(tableData,ensure_ascii=False) converts the \xfc to �
  • json.dumps(tableData,ensure_ascii=True) fails: "UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 5: invalid start byte"

What I really want is for json.dumps to leave the UTF-8 encoded character alone; to just pass the \xfc as is. That way the JavaScript script in the user's browser can do the decoding. Is that possible?

Or, am I approaching the problem incorrectly?

Here is the complete code:

import MySQLdb

...


    # Open the data base and return a handle to it and its cursor
    dataBase, dbCursor = database.OpenDB()

    # Get data from the URL
    fieldStore = cgi.FieldStorage()
    selFieldName = selFieldValue = ''
    sqlQuery = 'SELECT * FROM %s' % (database.CompTableName)
    if ('fldName' in fieldStore) and ('fldValue' in fieldStore):
        fldName = fieldStore['fldName'].value
        fldValue = fieldStore['fldValue'].value
        sqlQuery += ' WHERE %s = \'%s\'' % (fldName,fldValue)
    if ('max' in fieldStore):
        maxRows = fieldStore['max'].value
        sqlQuery += ' LIMIT ' + maxRows
     # Get the selected data in the table as a list of lists 
    rowsAffected = dbCursor.execute(sqlQuery)
    tableData = dbCursor.fetchall()
    # Close the database and return the results
    dataBase.close()
    
    jsonTableData = json.dumps(tableData,encoding='latin1',ensure_ascii=True)
    print jsonTableData

And here is test code:

    tableData = (('item1', 'Jones',), ('item2', 'Weidm\xfcller'))
    jsonTableData = json.dumps(tableData,encoding='latin1',ensure_ascii=True)
    print jsonTableData

Solution

  • \xfc is not the UTF-8 encoding of ü, it's the latin-1 encoding.

    >>> u'ü'.encode('latin-1')
    '\xfc'
    >>> u'ü'.encode('utf-8')
    '\xc3\xbc'
    

    If you json.dumps text, you shouldn't get replacement characters like that:

    >>> json.dumps({u"k": u"Weidmüller"})
    '{"k": "Weidm\\u00fcller"}'
    >>> json.dumps({u"k": u"Weidmüller"}, ensure_ascii=False)
    u'{"k": "Weidm\xfcller"}'
    

    Check to make sure that what you're getting from the database is correctly decoded text in the first place.