Search code examples
pythonjsoncsvutf-8byte-order-mark

Python Load UTF-8 JSON


I have the following JSON (for simplicity's sake I'll only use one but there are 100 entries in reality):

{
    "Active": false, 
    "Book": "US Derivat. London, Mike Übersax/Michael Jealous", 
    "ExpirationDate": "2006-10-12", 
    "Isin": "CH0013096497", 
    "IssueDate": "2001-10-09", 
    "KbForXMonths": "0", 
    "KbPeriodDay": "Period", 
    "KbType": "Prozent", 
    "KbYear": "0.5", 
    "Keyinvest_IssueRetro": "0.50%", 
    "Keyinvest_RecurringRetro": "1.00% pro rata temporis", 
    "Keyinvest_RetroPayment": "Every month", 
    "LastImportDate": "2008-12-31", 
    "LiberierungDate": "1900-01-01", 
    "NominalCcy": "USD", 
    "NominalStueck": "5,000", 
    "PrimaryCCR": "0", 
    "QuoteType": "Nominal", 
    "RealValor": "0", 
    "Remarks": "", 
    "RwbeProductId_CCR": "034900", 
    "RwbeProductId_EFS": "034900", 
    "SecName": "Cliquet GROI on Nasdaq", 
    "SecType": "EQ", 
    "SubscriptionEndDate": "1900-01-01", 
    "TerminationDate": "2003-10-19", 
    "TradingCcy": "USD", 
    "Valor": 1309649
}

I'm trying to read this JSON in order to save it as a .csv (so that I can import it into a database)

However when i try to write this JSON data as a csv like so:

with codecs.open('EFSDUMP.csv', 'w', 'utf-8-sig') as csv_file:
    content_writer = csv.writer(csv_file, delimiter=',')
    content_writer.writerow(data.values())

I get an error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 25: ordinal not in range(128)

That is because there's an umlaut in the JSON (see attribute "Book").

I try to read the JSON like this:

data = json.loads(open('EFSDUMP.json').read().decode('utf-8-sig'))

What's interesting is that this:

print data

Gives me this:

{u'PrimaryCCR': u'0', u'SecType': u'EQ', u'Valor': 1309649, u'KbType': u'Prozent', u'Book': u'US Derivat. London, Mike \xdcbersax/Michael Jealous', u'Keyinvest_RecurringRetro': u'1.00% pro rata temporis', u'TerminationDate': u'2003-10-19', u'RwbeProductId_CCR': u'034900', u'SubscriptionEndDate': u'1900-01-01', u'ExpirationDate': u'2006-10-12', u'Keyinvest_RetroPayment': u'Every month', u'Keyinvest_IssueRetro': u'0.50%', u'QuoteType': u'Nominal', u'KbYear': u'0.5', u'LastImportDate': u'2008-12-31', u'Remarks': u'', u'RealValor': u'0', u'SecName': u'Cliquet GROI on Nasdaq', u'Active': False, u'KbPeriodDay': u'Period', u'Isin': u'CH0013096497', u'LiberierungDate': u'1900-01-01', u'IssueDate': u'2001-10-09', u'KbForXMonths': u'0', u'NominalCcy': u'USD', u'RwbeProductId_EFS': u'034900', u'TradingCcy': u'USD', u'NominalStueck': u'5,000'}

Clearly the umlaut became a '\xdc'

However when I do this:

print data['Book']

Meaning I access the attribute directly, I get:

US Derivat. London, Mike Übersax/Michael Jealous

So the umlaut is an actual umlaut again.

I'm pretty sure that the JSON is UTF-8 without BOM (Notepad++ claims so)

I have already tried all of the suggestions here without any success: Python load json file with UTF-8 BOM header

How can I properly read the UTF-8 JSON file in order to be able to write it as .csv?

Any help is greatly appreciated.

Python version: 2.7.2


Solution

  • In Python 2, the csv module does not support writing Unicode. You need to encode it manually here, as otherwise your Unicode values are encoded for you using ASCII (which is why you got the encoding exception).

    This also means you need to write the UTF-8 BOM manually, but only if you really need it. UTF-8 can only be written one way, a Byte Order Mark is not needed to read UTF-8 files. Microsoft likes to add it to files to make the task of detecting file encodings easier for their tools, but the UTF-8 BOM may actually make it harder for other tools to work correctly as they won't ignore the extra initial character.

    Use:

    with open('EFSDUMP.csv', 'wb') as csv_file:
        csv_file.write(codecs.BOM_UTF8)
        content_writer = csv.writer(csv_file)
        content_writer.writerow([unicode(v).encode('utf8') for v in data.values()])
    

    Note that this'll write your values in arbitrary (dictionary) order. The unicode() call will convert non-string types to unicode strings first before encoding.

    To be explicit: you've loaded the JSON data just fine. It is the CSV writing that failed for you.