Search code examples
pythoncsvgzipadapter

Using Gzip in Python3 with Csv


The goal is to create python2.7 and >=python3.6 compatible code.

This code currently works on python2.7. It creates a GzipFile object and later writes lists to the gzip file. It lastly uploads the gzip file to an s3 bucket.

Starting with:

data = [ [1, 2, 3], [4, 5, 6], ["a", 3, "iamastring"] ]

I tried:

def get_gzip_writer(path):
  with s3_reader.open(path) as s3_file:
    with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
      yield csv.writer(gzip_file)

However, this code does not work on python3 due to csv giving str whereas gzip expects bytes. It's important to keep gzip in bytes due to how it's used/read later on. That means using io.TextIOWrapper does not work in this specific use case.

I have tried to create an adapter class.

class BytesToBytes(object):
  def __init__(self, stream, dialect, encoding, **kwargs):
    self.temp = six.StringIO()
    self.writer = csv.writer(self.temp, dialect, **kwargs)
    self.stream = stream
    self.encoding = encoding
  
  def writerow(self, row):
    self.writer.writerow([s.decode('utf-8') if hasattr(s, 'decode') else s for s in row])
    self.stream.write(six.ensure_binary(self.temp.getvalue(), encoding))
    self.temp.seek(0)
    self.temp.truncate(0)

With the updated code looking like:

def get_gzip_writer(path):
  with s3_reader.open(path) as s3_file:
    with gzip.GzipFile(fileobj=s3_file, mode="w") as gzip_file:
      yield BytesToBytes(gzip_file)

This works, but it seems excessive to have a full class for the purpose of this singular use case.

This is the code that calls the above:

def write_data(data, url):
  with get_gzip_writer(url) as writer:
    for row in data:
      writer.writerow(row)
  return url

What options are available for working with GzipFile (while maintaining bytes for read/write) without creating an entire adapter class?


Solution

  • I've read and considered your concern w/keeping the GZip file in binary mode, and I think you can still use TextIOWrapper. My understanding is that its job is to provide an interface for writing bytes from text (my emphasis):

    A buffered text stream providing higher-level access to a BufferedIOBase buffered binary stream.

    I interpret that as "text in, bytes out"... which is what your GZip application needs, right? If so, then for Python3 we need to give the CSV writer something that accepts strings but ultimately writes bytes.

    Enter TextIOWrapper with a UTF-8 encoding, accepting strings from csv.writer's writerow/s() methods and writing UTF-8-encoded bytes to gzip_file.

    I've run this in Python2 and 3, and unzipped the file and it looks good:

    import csv, gzip, io, six
    
    def get_gzip_writer(path):
      with open(path, 'wb') as s3_file:
        with gzip.GzipFile(fileobj=s3_file, mode='wb') as gzip_file:
            if six.PY3:
                with io.TextIOWrapper(gzip_file, encoding='utf-8') as wrapper:
                    yield csv.writer(wrapper)
            elif six.PY2:
                yield csv.writer(gzip_file)
            else:
                raise ValueError('Neither Python2 or 3?!')
    
    
    data = [[1,2,3],['a','b','c']]
    url = 'output.gz'
    
    for writer in get_gzip_writer(url):
        for row in data:
            writer.writerow(row)