Convert an image in a PySpark dataframe to a Numpy array

I have a DataFrame in PySpark (version 3.1.2) which contains images:

img_path = "s3://multimedia-commons/data/images/000/24a/00024a73d1a4c32fb29732d56a2.jpg"
df = spark.read.format("image").load(img_path)
df.printSchema()
df.select("image.height", "image.width"
         ,"image.nChannels", "image.mode"
         ,"image.data").show()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

+------+-----+---------+----+--------------------+
|height|width|nChannels|mode|                data|
+------+-----+---------+----+--------------------+
|   260|  500|        3|  16|[00 00 00 00 00 0...|
+------+-----+---------+----+--------------------+

I need to convert the image into a Numpy array to pass to a machine learning model.

The approach in https://stackoverflow.com/a/69215982/11262633 seems reasonable, but is giving me incorrect image values.

import pyspark.sql.functions as F
from pyspark.ml.image import ImageSchema
from pyspark.ml.linalg import DenseVector, VectorUDT
import numpy as np

img2vec = F.udf(lambda x: DenseVector(ImageSchema.toNDArray(x).flatten()), VectorUDT())

print(f'Image fields = {ImageSchema.imageFields}')
df_new = df.withColumn('vecs',img2vec('image'))

row_dict = df_new.first().asDict()
img_vec = row_dict['vecs']

img_dict = row_dict['image']
width = img_dict['width']
height = img_dict['height']
nChannels = img_dict['nChannels']
img_np = img_vec.reshape(height, width, nChannels)

m = np.ma.masked_greater(img_np, 100)
m_mask = m.mask
args = np.argwhere(m_mask)
for idx, (r, c, _) in enumerate(args):
    print(r, c, img_np[r,c])
    if idx > 5:
        break

Output:

46 136 [  0.  13. 101.]
47 104 [  1.  15. 102.]
47 105 [  1.  16. 104.]
47 106 [  1.  16. 104.]
47 107 [  1.  16. 104.]
47 108 [  1.  16. 104.]
47 109 [  1.  15. 105.]

Here's a visualization of the image:

Desired Results

Reading the image using Pillow gives a different result:

from PIL import Image
import numpy as np

img = Image.open('/home/hadoop/00024a73d1a4c32fb29732d56a2.jpg')
img_np = np.asarray(img)
m = np.ma.masked_greater(img_np, 100)
m_mask = m.mask
args = np.argwhere(m_mask)
for idx, (r, c, _) in enumerate(args):
    print(r, c, img_np[r,c])
    if idx > 5:
        break

Output:

47 104 [101  16   9]
47 105 [103  16   9]
47 106 [103  16   9]
47 107 [103  16   9]
47 108 [103  16   9]
47 109 [104  15   9]
47 110 [105  16  10]

My question

Why are the images different, both in appearance, and when I read individual pixels?

Using np.asarray on the bytes data returned by PySpark gave the same issue. Maybe PySpark is fine and there's just some error in my manipulations of the returned data. I've spent about 8 hours working on this. Thanks in advance for any insights you may have.

Solution

This is because spark uses

data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)

And Pillow is rendering it RGB.