Search code examples
pythonpandasdaskfeather

Load many feather files in a folder into dask


With a folder with many .feather files, I would like to load all of them into dask in python.

So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277

files = [...]
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.concat(dfs)

Unfortunately, this gives me the error TypeError: Truth of Delayed objects is not supported which is mentioned there, but a workaround is not clear.

Is it possible to do the above in dask?


Solution

  • Instead of concat, which operates on dataframes, you want to use from_delayed, which turns a list of delayed objects, each of which represents a dataframe, into a single logical dataframe

    dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
    df = dd.from_delayed(dfs)
    

    If possible, you should also supply the meta= (a zero-length dataframe, describing the columns, index and dtypes) and divisions= (the boundary values of the index along the partitions) kwargs.