Search code examples
pythondaskdask-distributed

Read single large zipped csv (too large for memory) using Dask


I have a use case where I have an S3 bucket containing a list of a few hundred gzipped files. Each individual file, when unzipped and loaded into a dataframe, occupies more than the available memory. I'd like to read these files and perform some functions. But, I can't seem to figure out how to make that work out. I've tried Dask's read_table, but since compression='gzip' seems to require blocksize=None, that doesn't seem to work. Is there something I'm missing,? or is there a better approach?

Here's how I'm trying to call read_table (note that names, types, and nulls are defined elsewhere):

df = dd.read_table(uri,
        compression='gzip',
        blocksize=None,
        names=names,
        dtype=types,
        na_values=nulls,
        encoding='utf-8',
        quoting=csv.QUOTE_NONE,
        error_bad_lines=False)

I'm open to any suggestions, including changing the entire approach.


Solution

  • This cannot be done directly, because the only way to know where you are within a gzip file stream, is the read it from the very beginning. That means that every partition of data would need to read the whole file from the beginning, so Dask has decided to explicitly error instead.

    I might, in theory, be possible to scan your file once, and figure out where the gzip blocks all start. However, there is no code around to do this, as far as I know.