arrays python-3.x numpy serialization bson

Why is a BSON serialized numpy array much bigger than the original?

I'm working with images in numpy array form. I need to serialize/deserialize them to/from JSON (I'm using MongoDB)

numpy arrays cannot be serialized with json.dump; I am aware of this but I wonder if there is a better way, since the conversion of a bytes numpy array to BSON multiplies the number of bytes by almost 12 (I don't understand why):

import numpy as np
import bson
from io import StringIO as sio
RC = 500
npdata = np.zeros(shape=(RC,RC,3), dtype='B')
rows, cols, depth = npdata.shape
npsize = rows*cols*depth
npdata=npdata.reshape((npsize,))
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
lb = len(bsondata)
print(lb, npsize, lb/npsize) 

> 8888926 750000 11.851901333333334

Solution

The reason for this increased number of bytes is how BSON saves the data. You can find this information in the BSON specification, but let's look at a concrete example:

import numpy as np
import bson

npdata = np.arange(5, dtype='B') * 11
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})

print([hex(b) for b in bsondata])

Here, we store an array with values [0, 11, 22, 33, 44, 55] as BSON and print the resulting binary data. Below I have annotated the result to explain what's going on:

['0x47', '0x0', '0x0', '0x0',  # total number of bytes in the document
 # First element in document
     '0x4',  # Array
     '0x64', '0x61', '0x74', '0x61', '0x0',  # key: "data"
     # subdocument (data array)
         '0x4b',  '0x0', '0x0', '0x0',  # total number of bytes
         # first element in data array
             '0x10',                        # 32 bit integer
             '0x30', '0x0',                 # key: "0"
             '0x0', '0x0', '0x0', '0x0',    # value: 0
         # second element in data array
             '0x10',                        # 32 bit integer
             '0x31', '0x0',                 # key: "1"
             '0xb', '0x0', '0x0', '0x0',    # value: 11
         # third element in data array
             '0x10',                        # 32 bit integer
             '0x32', '0x0',                 # key: "2"
             '0x16', '0x0', '0x0', '0x0',   # value: 22             
 # ...
]

In addition to some format overhead, each value of the array is rather wastefully encoded with 7 bytes: 1 byte to specify the data type, 2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...) and 4 bytes for the 32 bit integer value.

This at least explains why the BSON data is so much bigger than the original array.

I found two libraries GitHub - mongodb/bson-numpy and GitHub - ajdavis/bson-numpy which may do a better job of encoding numby arrays in BSON. However, I did not try them, so I can't say if that is the case or if they even work correctly.