Search code examples
daskfastparquet

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet


I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories.

For example:

import fastparquet
pfile = fastparquet.ParquetFile(['data/data1.parq', 'data/data2.parq'])

works just fine, but if I copy data2.parq to a different directory, the following does not work:

pfile = fastparquet.ParquetFile(['data/data1.parq', 'data2/data2.parq'])

The traceback I get is the following:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-b3d381f14edc> in <module>()
----> 1 pfile = fastparquet.ParquetFile(['data/data1.parq', 'data2/data2.parq'])

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     82         if isinstance(fn, (tuple, list)):
     83             basepath, fmd = metadata_from_many(fn, verify_schema=verify,
---> 84                                                open_with=open_with)
     85             self.fn = sep.join([basepath, '_metadata'])  # effective file
     86             self.fmd = fmd

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/util.py in metadata_from_many(file_list, verify_schema, open_with)
    164     else:
    165         raise ValueError("Merge requires all PaquetFile instances or none")
--> 166     basepath, file_list = analyse_paths(file_list, sep)
    167 
    168     if verify_schema:

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/util.py in analyse_paths(file_list, sep)
    221     if len({tuple([p.split('=')[0] for p in parts[l:-1]])
    222             for parts in path_parts_list}) > 1:
--> 223         raise ValueError('Partitioning directories do not agree')
    224     for path_parts in path_parts_list:
    225         for path_part in path_parts[l:-1]:

ValueError: Partitioning directories do not agree

I get the same error when using dask.dataframe.read_parquet, which I assume uses the same ParquetFile object.

How can I load multiple files from different directories? Putting all the files I need to load into the same directory is not an option.


Solution

  • This does work in fastparquet on master, if using either absolute paths or explicit relative paths:

    pfile = fastparquet.ParquetFile(['./data/data1.parq', './data2/data2.parq'])
    

    The need for the leading ./ should be considered a bug - see the issue.