Search code examples
pythonapache-sparktiff

Spark: When reading tif images dataframe only contains rows with empty byte arrays


I'm trying to process multiple folders with 810 seperate tif files.

Folder structure:

enter image description here

Upon trying to create a dataframe for this I'm running into the issue that the loaded bytearrays are empty. And I obviously need those for processing.

Dataframe creation:

spark = SparkSession \
    .builder \
    .appName(name) \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .getOrCreate()
file_rdd = spark.read.format('image').load(argv[1] + '/' + '*/*')

Argv obviously contains the base folder as the first parameter. When debugging (via debugger or printing) I noticed that my dataframe is a bunch of rows that only have the origin set, and all the other values are either -1 or empty.

enter image description here

I mainly need the byte array to be filled in, as well as an origin. Although, when observing the memory used on my system there is an obvious spike, indicating that it definitely loading something.

Am I doing something wrong or unsupported?


Solution

  • The -1s mean that the corresponding images are invalid. If you add the dropInvalid option and set it to True, those will probably not be present at all.

    Spark uses Java's ImageIO library to read images. ImageIO make use of plug-ins to support different image formats. Java versions up to 8 only come with plug-ins for JPEG, PNG, BMP, WBMP, and GIF. Java 9 adds a standard plug-in for TIFF. Since Spark officially supports Java 8 only, your options is to use a 3rd party TIFF plug-in for ImageIO, for example this one provided by a fellow Stack Overflow user.

    To use the aforementioned plug-in, add something like this to the Spark session configuration:

    .config("spark.jars.packages", "com.twelvemonkeys.imageio:imageio-tiff:3.5,com.twelvemonkeys.imageio:imageio-core:3.5") \
    

    You can track the package versions in the Maven Index.