Search code examples
hdfsparquetpyarrow

read a parquet files from HDFS using PyArrow


I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()

I also know I can read a parquet file using pyarrow.parquet's read_table()

However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance.

Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along.


Solution

  • Try

    fs = pa.hdfs.connect(...)
    fs.read_parquet('/path/to/hdfs-file', **other_options)
    

    or

    import pyarrow.parquet as pq
    with fs.open(path) as f:
        pq.read_table(f, **read_options)
    

    I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this