Search code examples
pythonhdfsdaskcephdistributed-filesystem

Distributed file systems supported by Python/Dask


Which distributed file systems are supported by Dask? Specifically, from which file systems one could read dask.dataframe's? From the Dask documentation I can see that HDFS is certainly supported. Are any other distributed file systems supported, e.g. Ceph, etc?

I could find some discussion on thoughts to support other file systems here: https://github.com/dask/distributed/issues/33 but no final conclusion, except that HDFS is "nastier" than other options.

Thank you for your help!


Solution

  • The simplest answer is, that if you can mount the filesystems onto every node, i.e., that it can be accessed as a local filesystem, then you can use any distributed system - without any performance optimisation for the original location of any given file chunk.

    I cases where you have data location available from a metadata service (which would be true for ceph), you could limit loading tasks to run only on machines where the data is resident. This is not implemented, but maybe would be not too complicated from the user side. A similar thing was done in the past for hdfs, but we found that the optimisation did not justify the extra complexity of the code.