Search code examples
pythonpysparkpython-imaging-librarydatabrickslmdb

How to convert lmdb file to PySpark Image DataFrame?


I have a lmdb file whose value contains jpeg image data in binary string format. I want to save all the images to a folder and create a PySpark DataFrame to do my analysis. I am doing this because I want to train a Mask RCNN model on TensorFlow using this data.

I have two questions:

  • Is it a good idea? (I am considering it because this way I will be able to do distributed training and inference.)
  • How do I do it?

One way I can go about achieving this: Save images one by one to a folder and then read that folder as a PySpark Image DataFrame.

import io
from PIL import Image

for key, value in lmdb_data:
    with io.BytesIO(value ) as f:
        image = Image.open(f)
        # The image is of class JpegImageFile 
        image.load()
        image.save(f"/tmp/lmdb_images/{key}.{image.format.lower()}")

df = spark.read.format("image").load("/tmp/lmdb_images/")
df.display()  

Is there any other, more efficient/elegant way to do it?


Solution

  • I can only comment on the PIL side of things because I don't use PySpark.

    If your lmbd data is already a JPEG-encoded image, there is no point decoding it into a PIL Image and then re-encoding it back to JPEG to save it to disk. You might as well just write the JPEG you already have to disk. Untested, but it will look something like:

    for key, value in lmdb_data:
        with open(f"/tmp/...", "wb") as f:
           f.write(value)