I have tiff images stored in tar files in HDFS. I can download the tar file and stream from it in this way:
tar = tarfile.open("filename.tar", 'r|')
for tiff in tar:
if tiff.isfile():
a = tar.extractfile(tiff).read()
na = np.frombuffer(c, dtype=np.uint8)
im = cv2.imdecode(na, cv2.IMREAD_COLOR)
which gives me a numpy array. I want to see if there is a way to stream tiff files directly from the tar files in hdfs.
Here is what I have:
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open(hdfs_path_to_tar_file, 'rb') as f:
print(type(f))
<class 'pyarrow.lib.HdfsFile'>
I don't know how to read it with tarfile
. I need to convert it to a bytes type object that I can read with tarfile.open
. But I don't want to read the whole file at first. tar files are pretty huge so I don't want to put them in the memory i.e f.read()
returns bytes but puts the whole thing in the memory. Although, tarfile.open
couldn't read that, too.
Try passing the HDFS file handle to the fileobj
argument of tarfile.open
tf = tarfile.open(fileobj=f)