Search code examples
mongodbgobson

Why json.RawMessage enlarge mongoDb document size?


The following codes try to insert new documents into mongoDB through go.mongodb.org/mongo-driver

    data := "this is test string blablablablablablabla"
    type Doc struct {
        Version int "json:version, bson:version"
        Data   string   "json:data, bson:data"
    }
    dd := Doc{Version: 21, Data: data}
    dObj, _ := json.Marshal(dd)

    queryFilter := bson.M{"version": 1}
    update1 := bson.M{"$set": bson.M{"version": 1, "data": json.RawMessage(dObj)}}

    // insert data with json.RawMessage
    _, err := db.Mongo("test").Collection("test_doc1").UpdateOne(context.Background(), queryFilter, update1, options.Update().SetUpsert(true))
    if err != nil {
        fmt.Println("failed to insert doc1")
    }

    update2 := bson.M{"$set": bson.M{"version": 1, "data": (dObj)}}

    // insert data without json.RawMessage
    _, err = db.Mongo("test").Collection("test_doc2").UpdateOne(context.Background(), queryFilter, update2, options.Update().SetUpsert(true))
    if err != nil {
        fmt.Println("failed to insert doc2")
    }

The content of test_doc1 is "data": json.RawMessage(dObj), whereas the content of test_doc2 is "data": (dObj).

The document content as below

db.test_doc1.find()
{ "_id" : ObjectId("5da164a950d625a5b2e5d23e"), "version" : 1, "data" : [ 123, 34, 86, 101, 114, 115, 105, 111, 110, 34, 58, 50, 49, 44, 34, 68, 97, 116, 97, 34, 58, 34, 116, 104, 105, 115, 32, 105, 115, 32, 116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 103, 32, 98, 108, 97, 98, 108, 97, 98, 108, 97, 98, 108, 97, 98, 108, 97, 98, 108, 97, 98, 108, 97, 34, 125 ] }

db.test_doc2.find()
{ "_id" : ObjectId("5da164a950d625a5b2e5d249"), "version" : 1, "data" : BinData(0,"eyJWZXJzaW9uIjoyMSwiRGF0YSI6InRoaXMgaXMgdGVzdCBzdHJpbmcgYmxhYmxhYmxhYmxhYmxhYmxhYmxhIn0=") }

After check the size of the above two documents

Object.bsonsize(db.test_doc2.findOne())
111

Object.bsonsize(db.test_doc1.findOne())
556

The size of test_doc1 is more larger than test_doc2. Why?

Per bson doc

Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.

Bson array could occupy more disk space? Am I right?

MongoDB version: 4.0


Solution

  • test_doc1 uses json.RawMessage which is essentially []byte so it gets stored as an array of integers which represent the string (raw representation of the document).

    test_doc2 is storing the data as binary data which is a more compact form.

    The Go Mongo Driver uses the WriteBinaryWithSubtype method for the json encoded data but uses WriteArray for the RawMessage.

    The difference is in the data type being used on the mongo side to store these data. One is storing the byte slice as an array of integers, another is storing data as binary with a sub type. The binary form takes less space compared to the integers.

    Digging deeper, I noticed the Go driver uses a registry to determine how it should encode a value to BSON. There's a method dedicated to byte slices.

    // ByteSliceEncodeValue is the ValueEncoderFunc for []byte.
    func (dve DefaultValueEncoders) ByteSliceEncodeValue(ec EncodeContext, vw bsonrw.ValueWriter, val reflect.Value) error {
    

    This method uses the WriteBinary() method to encode byte slices as binary data.

    Where as, if there's a custom type (even if it's a []byte underneath), it would be treated as a slice type and trigger the "default encoder" for slices.

    // SliceEncodeValue is the ValueEncoderFunc for slice types.
    func (dve DefaultValueEncoders) SliceEncodeValue(ec EncodeContext, vw bsonrw.ValueWriter, val reflect.Value) error {
    

    This method uses the WriteArray() method in turn.

    Summary: The json.Marshal call gets use []byte types directly so they are treated as bson binary type and stored in the compact binary form. json.RawMessage even though stores the data as []byte internally is treated as a slice, a slice of integers and thus stored in mongo as an array of integers.