Search code examples
numpypysparkpython-imaging-library

Convert an image in a PySpark dataframe to a Numpy array


I have a DataFrame in PySpark (version 3.1.2) which contains images:

img_path = "s3://multimedia-commons/data/images/000/24a/00024a73d1a4c32fb29732d56a2.jpg"
df = spark.read.format("image").load(img_path)
df.printSchema()
df.select("image.height", "image.width"
         ,"image.nChannels", "image.mode"
         ,"image.data").show()
root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

+------+-----+---------+----+--------------------+
|height|width|nChannels|mode|                data|
+------+-----+---------+----+--------------------+
|   260|  500|        3|  16|[00 00 00 00 00 0...|
+------+-----+---------+----+--------------------+

I need to convert the image into a Numpy array to pass to a machine learning model.

The approach in https://stackoverflow.com/a/69215982/11262633 seems reasonable, but is giving me incorrect image values.

import pyspark.sql.functions as F
from pyspark.ml.image import ImageSchema
from pyspark.ml.linalg import DenseVector, VectorUDT
import numpy as np

img2vec = F.udf(lambda x: DenseVector(ImageSchema.toNDArray(x).flatten()), VectorUDT())

print(f'Image fields = {ImageSchema.imageFields}')
df_new = df.withColumn('vecs',img2vec('image'))

row_dict = df_new.first().asDict()
img_vec = row_dict['vecs']

img_dict = row_dict['image']
width = img_dict['width']
height = img_dict['height']
nChannels = img_dict['nChannels']
img_np = img_vec.reshape(height, width, nChannels)

m = np.ma.masked_greater(img_np, 100)
m_mask = m.mask
args = np.argwhere(m_mask)
for idx, (r, c, _) in enumerate(args):
    print(r, c, img_np[r,c])
    if idx > 5:
        break    

Output:

46 136 [  0.  13. 101.]
47 104 [  1.  15. 102.]
47 105 [  1.  16. 104.]
47 106 [  1.  16. 104.]
47 107 [  1.  16. 104.]
47 108 [  1.  16. 104.]
47 109 [  1.  15. 105.]

Here's a visualization of the image:

enter image description here

Desired Results

Reading the image using Pillow gives a different result:

from PIL import Image
import numpy as np

img = Image.open('/home/hadoop/00024a73d1a4c32fb29732d56a2.jpg')
img_np = np.asarray(img)
m = np.ma.masked_greater(img_np, 100)
m_mask = m.mask
args = np.argwhere(m_mask)
for idx, (r, c, _) in enumerate(args):
    print(r, c, img_np[r,c])
    if idx > 5:
        break    

Output:

47 104 [101  16   9]
47 105 [103  16   9]
47 106 [103  16   9]
47 107 [103  16   9]
47 108 [103  16   9]
47 109 [104  15   9]
47 110 [105  16  10]

enter image description here

My question

Why are the images different, both in appearance, and when I read individual pixels?

Using np.asarray on the bytes data returned by PySpark gave the same issue. Maybe PySpark is fine and there's just some error in my manipulations of the returned data. I've spent about 8 hours working on this. Thanks in advance for any insights you may have.


Solution

  • This is because spark uses

    data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)

    And Pillow is rendering it RGB.