Search code examples
pythonhadoopparquet

get parquet file from HDFS with python


I build my own hdfs with windows single cluster, got reference from this link

then I already input my parquet files, but I can't read the file from other computer.

here's my code with python

import pyarrow as pa
import pyarrow.parquet as pq


hdfs_path = "hdfs://10.35.105.35:9820/tampo/oee_tampo.parquet"


fs = pa.hdfs.connect()


table = pq.read_table(hdfs_path, filesystem=fs)


import pandas as pd
df = table.to_pandas()

fs.close()

error

 1522     # pipe will not close when the child process exits and the
   1523     # ReadFile will hang.
   1524     self._close_pipe_fds(p2cread, p2cwrite,
   1525                          c2pread, c2pwrite,
   1526                          errread, errwrite)

FileNotFoundError: [WinError 2] The system cannot find the file specified

Anyone who can fix this, or is that any other ways to get the parquet file from hdfs?


Solution

  • Have you tried pandas' read_parquet()?

    df = pd.read_parquet('hdfs://10.35.105.35:9820/tampo/oee_tampo.parquet')
    df