Search code examples
pythondataframedask

Reading multiple csvs from zip file using dask


I have been trying to read multiple CSVs from a zipped directory using dask, according to this answer. However, I get a long error message which I cannot make sense of. I think the important line is this one:

msgpack.exceptions.ExtraData: unpack(b) received extra data.

The data is publicly available.

import numpy as np
import pandas as pd
import dask.dataframe as dd

# read data, the dask way
df = dd.read_csv('zip://BACI*.csv', sep=",", dtype={"k":str, "i":int, "j":int, "t":int}, storage_options={'fo': '../input/baci_hs92.zip'})
df.head()

I believe this kind of fly-by extraction should work in dask and I would rather not extract all files into some directory as other answers have suggested.


Solution

  • The following is equivalent and works:

    In [1]: u = "http://www.cepii.fr/DATA_DOWNLOAD/baci/data/BACI_HS92_V202301.zip"
    
    In [2]: import numpy as np
       ...: import pandas as pd
       ...: import dask.dataframe as dd
       ...:
       ...: # read data, the dask way
       ...: df = dd.read_csv(f'zip://BACI*.csv::{u}', sep=",", dtype={"k":str, "i":int, "j":int, "t":int})
       ...: df.head()
    

    This is slow, however, since dask preemptively reads blocks of the compressed data (which must be scanned from the start of each member file, because DEFLATE is hard) to find newline offsets.

    If instead you add blocksize=None, the upfront cost is much smaller, since there is no need to find newlines; however, even getting the .head() requires reading the whole of the first compressed file. In addition, it shows a dtype mismatch for column "q", presumably because the first few lines used to guess all have numbers, but later there are object-type things in the same column.

    The kerchunk project is interested in both finding and indexing newlines in CSVs ( https://github.com/fsspec/kerchunk/issues/66 ) and indexing ZIP/gzip files ( https://github.com/fsspec/kerchunk/issues/281 ) which would mean fast parallel access to data like this once someone had done upfront indexing work. This functionality does not yet exist.