I'm working with images in numpy array form. I need to serialize/deserialize them to/from JSON (I'm using MongoDB)
numpy arrays cannot be serialized with json.dump; I am aware of this but I wonder if there is a better way, since the conversion of a bytes numpy array to BSON multiplies the number of bytes by almost 12 (I don't understand why):
import numpy as np
import bson
from io import StringIO as sio
RC = 500
npdata = np.zeros(shape=(RC,RC,3), dtype='B')
rows, cols, depth = npdata.shape
npsize = rows*cols*depth
npdata=npdata.reshape((npsize,))
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
lb = len(bsondata)
print(lb, npsize, lb/npsize)
> 8888926 750000 11.851901333333334
The reason for this increased number of bytes is how BSON saves the data. You can find this information in the BSON specification, but let's look at a concrete example:
import numpy as np
import bson
npdata = np.arange(5, dtype='B') * 11
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
print([hex(b) for b in bsondata])
Here, we store an array with values [0, 11, 22, 33, 44, 55]
as BSON and print the resulting binary data. Below I have annotated the result to explain what's going on:
['0x47', '0x0', '0x0', '0x0', # total number of bytes in the document
# First element in document
'0x4', # Array
'0x64', '0x61', '0x74', '0x61', '0x0', # key: "data"
# subdocument (data array)
'0x4b', '0x0', '0x0', '0x0', # total number of bytes
# first element in data array
'0x10', # 32 bit integer
'0x30', '0x0', # key: "0"
'0x0', '0x0', '0x0', '0x0', # value: 0
# second element in data array
'0x10', # 32 bit integer
'0x31', '0x0', # key: "1"
'0xb', '0x0', '0x0', '0x0', # value: 11
# third element in data array
'0x10', # 32 bit integer
'0x32', '0x0', # key: "2"
'0x16', '0x0', '0x0', '0x0', # value: 22
# ...
]
In addition to some format overhead, each value of the array is rather wastefully encoded with 7 bytes: 1 byte to specify the data type, 2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...) and 4 bytes for the 32 bit integer value.
This at least explains why the BSON data is so much bigger than the original array.
I found two libraries GitHub - mongodb/bson-numpy and GitHub - ajdavis/bson-numpy which may do a better job of encoding numby arrays in BSON. However, I did not try them, so I can't say if that is the case or if they even work correctly.