Search code examples
pythonunicodecsvutf-8

Python DictWriter writing UTF-8 encoded CSV files


  1. I have a list of dictionaries containing unicode strings.
  2. csv.DictWriter can write a list of dictionaries into a CSV file.
  3. I want the CSV file to be encoded in UTF8.
  4. The csv module cannot handle converting unicode strings into UTF8.
  5. The csv module documentation has an example for converting everything to UTF8:

    def utf_8_encoder(unicode_csv_data):
        for line in unicode_csv_data:
            yield line.encode('utf-8')
    
  6. It also has a UnicodeWriter class.

But... how do I make DictWriter work with these? Wouldn't they have to inject themselves in the middle of it, to catch the disassembled dictionaries and encode them before it writes them to the file? I don't get it.


Solution

  • UPDATE: The 3rd party unicodecsv module implements this 7-year old answer for you. Example below this code. There's also a Python 3 solution that doesn't required a 3rd party module.

    Original Python 2 Answer

    If using Python 2.7 or later, use a dict comprehension to remap the dictionary to utf-8 before passing to DictWriter:

    # coding: utf-8
    import csv
    
    D = {'name': u'马克', 'pinyin': u'mǎkè'}
    
    f = open('out.csv', 'wb')
    f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
    w = csv.DictWriter(f, sorted(D.keys()))
    w.writeheader()
    w.writerow({k:v.encode('utf8') for k, v in D.items()})
    f.close()
    

    You can use this idea to update UnicodeWriter to DictUnicodeWriter:

    # coding: utf-8
    import csv
    import cStringIO
    import codecs
    
    class DictUnicodeWriter(object):
    
        def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
            # Redirect output to a queue
            self.queue = cStringIO.StringIO()
            self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
            self.stream = f
            self.encoder = codecs.getincrementalencoder(encoding)()
    
        def writerow(self, D):
            self.writer.writerow({k:v.encode("utf-8") for k, v in D.items()})
            # Fetch UTF-8 output from the queue ...
            data = self.queue.getvalue()
            data = data.decode("utf-8")
            # ... and reencode it into the target encoding
            data = self.encoder.encode(data)
            # write to the target stream
            self.stream.write(data)
            # empty queue
            self.queue.truncate(0)
    
        def writerows(self, rows):
            for D in rows:
                self.writerow(D)
    
        def writeheader(self):
            self.writer.writeheader()
    
    D1 = {'name': u'马克', 'pinyin': u'Mǎkè'}
    D2 = {'name': u'美国', 'pinyin': u'Měiguó'}
    f = open('out.csv', 'wb')
    f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
    w = DictUnicodeWriter(f, sorted(D.keys()))
    w.writeheader()
    w.writerows([D1, D2])
    f.close()
    

    Python 2 unicodecsv Example:

    # coding: utf-8
    import unicodecsv as csv
    
    D = {u'name': u'马克', u'pinyin': u'mǎkè'}
    
    with open('out.csv','wb') as f:
        w = csv.DictWriter(f, fieldnames=sorted(D.keys()), encoding='utf-8-sig')
        w.writeheader()
        w.writerow(D)
    

    Python 3:

    Additionally, Python 3's built-in csv module supports Unicode natively:

    import csv
    
    D = {'name': '马克', 'pinyin': 'mǎkè'}
    
    # Use 'w' and newline='' instead of 'wb' in Python 3.
    # Use 'utf-8-sig' for UTF-8 w/ BOM for Excel to read as UTF-8 properly.
    # Use 'utf8' for UTF-8 (no BOM) otherwise.
    with open('out.csv', 'w', encoding='utf-8-sig', newline='') as f: 
        w = csv.DictWriter(f, fieldnames=sorted(D))
        w.writeheader()
        w.writerow(D)