Search code examples
pandashttpdaskparquetfastparquet

Dask dataframe read parquet format fails from http


I have been dealing with this problem for a week. I use the command

from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")

I got invalid parquet magic. However ddf.read_parquet is Ok with "webhdfs://"

I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.


Solution

  • Although the comments already partly answer this question, I thought I would add some information as an answer

    • HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
    • webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
    • webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.