Search code examples
pythonstreamingtarfilepyarrow

Streaming files from a tar file in hdfs


I have tiff images stored in tar files in HDFS. I can download the tar file and stream from it in this way:

tar = tarfile.open("filename.tar", 'r|')
for tiff in tar:
    if tiff.isfile():
        a = tar.extractfile(tiff).read()
        na = np.frombuffer(c, dtype=np.uint8)
        im = cv2.imdecode(na, cv2.IMREAD_COLOR)

which gives me a numpy array. I want to see if there is a way to stream tiff files directly from the tar files in hdfs.

Here is what I have:

import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open(hdfs_path_to_tar_file, 'rb') as f:
    print(type(f))

<class 'pyarrow.lib.HdfsFile'>

I don't know how to read it with tarfile. I need to convert it to a bytes type object that I can read with tarfile.open. But I don't want to read the whole file at first. tar files are pretty huge so I don't want to put them in the memory i.e f.read() returns bytes but puts the whole thing in the memory. Although, tarfile.open couldn't read that, too.


Solution

  • Try passing the HDFS file handle to the fileobj argument of tarfile.open

    tf = tarfile.open(fileobj=f)