Search code examples
pythonimage-processingpysparkhdfstiff

Python Image Library fails to read HDFS path


I am trying to read a '.tif' image which is of [m,n,4] (rows, columns, channels) dimension with 'uint16' data type from HDFS in PySpark using libraries like 'tifffile' using the following code:

import tifffile as tiff\ img = tiff.imread('hdfs://master:9000/image1.tif')

, but I always get the message:


FileNotFoundError: [Errno 2] No such file or directory: '/home/user/spark_files/tfos/hdfs:/master:9000/image1.tif'.


The HDFS path of the image hdfs://master:9000/image1.tif is correct and 'tifffile' lib works well when using the local file system instead of HDFS. It looks like the image library does not understand HDFS paths! How to solve this considering Spark API can't read this kind of image?


Solution

  • Finally, I could solve this problem using hdfs and imagecodecs libraries:

    from pyarrow import hdfs
    import imagecodecs
    
    connect = hdfs.connect("master",9000)
    img_file = connect.open('/img1.tif', mode='rb')
    img_bytes = img_file.read()
    numpy_img = imagecodecs.tiff_decode(img_bytes)