Search code examples
arrayspython-3.xnumpyserializationbson

Why is a BSON serialized numpy array much bigger than the original?


I'm working with images in numpy array form. I need to serialize/deserialize them to/from JSON (I'm using MongoDB)

numpy arrays cannot be serialized with json.dump; I am aware of this but I wonder if there is a better way, since the conversion of a bytes numpy array to BSON multiplies the number of bytes by almost 12 (I don't understand why):

import numpy as np
import bson
from io import StringIO as sio
RC = 500
npdata = np.zeros(shape=(RC,RC,3), dtype='B')
rows, cols, depth = npdata.shape
npsize = rows*cols*depth
npdata=npdata.reshape((npsize,))
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
lb = len(bsondata)
print(lb, npsize, lb/npsize) 

> 8888926 750000 11.851901333333334 

Solution

  • The reason for this increased number of bytes is how BSON saves the data. You can find this information in the BSON specification, but let's look at a concrete example:

    import numpy as np
    import bson
    
    npdata = np.arange(5, dtype='B') * 11
    listdata = npdata.tolist()
    bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
    
    print([hex(b) for b in bsondata])
    

    Here, we store an array with values [0, 11, 22, 33, 44, 55] as BSON and print the resulting binary data. Below I have annotated the result to explain what's going on:

    ['0x47', '0x0', '0x0', '0x0',  # total number of bytes in the document
     # First element in document
         '0x4',  # Array
         '0x64', '0x61', '0x74', '0x61', '0x0',  # key: "data"
         # subdocument (data array)
             '0x4b',  '0x0', '0x0', '0x0',  # total number of bytes
             # first element in data array
                 '0x10',                        # 32 bit integer
                 '0x30', '0x0',                 # key: "0"
                 '0x0', '0x0', '0x0', '0x0',    # value: 0
             # second element in data array
                 '0x10',                        # 32 bit integer
                 '0x31', '0x0',                 # key: "1"
                 '0xb', '0x0', '0x0', '0x0',    # value: 11
             # third element in data array
                 '0x10',                        # 32 bit integer
                 '0x32', '0x0',                 # key: "2"
                 '0x16', '0x0', '0x0', '0x0',   # value: 22             
     # ...
    ]
    

    In addition to some format overhead, each value of the array is rather wastefully encoded with 7 bytes: 1 byte to specify the data type, 2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...) and 4 bytes for the 32 bit integer value.

    This at least explains why the BSON data is so much bigger than the original array.

    I found two libraries GitHub - mongodb/bson-numpy and GitHub - ajdavis/bson-numpy which may do a better job of encoding numby arrays in BSON. However, I did not try them, so I can't say if that is the case or if they even work correctly.