Search code examples
pythonhdfspyarrow

Pyarrow read remote host hdfs file


I followed tuto and guide from pyarrow doc but I still can't use correctly the hdfs file system to get file from my remote host.

pre-requise: https://arrow.apache.org/docs/11.0/python/filesystems.html#filesystem-hdfs example get file: https://arrow.apache.org/docs/11.0/python/generated/pyarrow.fs.HadoopFileSystem.html#pyarrow.fs.HadoopFileSystem.open_input_file

Here some code of mine:

import os
import pyarrow
from pyarrow import fs

os.environ['ARROW_LIBHDFS_DIR'] = ""
os.environ["HADOOP_HOME"] = ""
os.environ["JAVA_HOME"] = ""
os.environ["CLASSPATH"] = ""

hdfs_config = {
     "host" : "myhost",
     "port" : 9443,
     "user" : "me"
}

hdfs = fs.HadoopFileSystem(
    #hdfs_config['host'], 
    "default", # use the core.xml
    hdfs_config['port'], 
    user=hdfs_config['user']
)

This part create the pyarrow.fs.HadoopFileSystem with either the remote host I enter or the one in my core.xml file.

But then I have the problem my remote file on path like "/my_group/data/red_wine_quality.csv"

with hdfs.open_input_file(data) as f:
    print(f.readall())

File not found which is explicite but why is it searching on my local filesystem instead of the one on hdfs.

If use data for a local file it works but that's not what I want.

I tried many examples, hosts but none succeed to read the hdfs files.


Solution

  • Okay, after many tries I found why it didn't work.

    First os.environ["HADOOP_HOME"] is useless and wasn't taken in account no matter what I cange the value or the core-xml.

    Second to resolve the problem was with the class path not correctly initialized.

    following the doc.

    output = subprocess.run(
        ["hadoop", "classpath", "--glob"], 
        capture_output = True,
        text = True
    )
    
    os.environ["CLASSPATH"] = output.stdout
    

    did the trick.