Search code examples
mongodbdata-modelingdenormalizationnosql

Fragmentation in MongoDB when growing documents


Seems like a blog with comments is the standard example used for describing different modeling strategies when using MongoDB.

My question relates to the model where comments are modeled as a sub collection on a single blog post document (i.e one document stores everything related to a single blog post).

In the case of multiple simultaneous writes it seems like you would avoid overwriting previous updates if you use upserts and targeted update modifiers (like push). Meaning, saving the document for every comment added would not overwrite previously added comments. However, how does fragmentation come into play here? Is it realistic to assume that adding multiple comments over time will result in fragmented memory and potentially slower queries? Are there any guidelines for growing a document through sub collections?

I am also aware of the 16MB limit per document, but that to me seems like a theoretical limit since 16 MB would be an enormous amount of text. In the event of fragmentation, would the documents be compacted the next time the mongo instance is restarted and reads the database back into memory?

I know the way you expect to interact with the data is the best guiding principle for how to model the data (needing comments without the blog post parent etc). However I am interested in learning about potential issues with the highly denormalized single document approach. Are the issues I'm describing even realistic in the given blog post example?


Solution

  • Before answer your questions, I try to explain the storage mechanics of MongoDB approximately.

    • For a certain database test, you can see some files like test.0, test.1, ..., so DATABASE = [FILE, ...]
    • FILE = [EXTENT, ...]
    • EXTENT = [RECORD, ...]
    • RECORD = HEADER + DOCUMENT + PADDING
    • HEADER = SIZE + OFFSET + PREV_RECORD_POINTER + NEXT_RECORD_POINTER + FLAG + ...

    This link for your reference

    Now I try to answer some of your questions as possile as I can.

    1. How does fragmentation come to paly?
      It happens when the current record is not enough to store the updated document, then produce a migration with behaviors of storing the updated document into a new enough space and delete the original record. The deleted record turns out a fragment.

    2. Will it result in fragmented memory and potentially slower queries?
      Fragmented memory will occur. But it won't cause slower queries unless not enough memory to allocate eventually.

    However, the deleted record can be reused if the new coming document can fit into it. Below is a simple solid proof.
    (Pay attention to the filed offset)

    > db.a.insert([{_id:1},{_id:2},{_id:3}]);
    BulkWriteResult({
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 3,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [ ]
    })
    > db.a.find()
    { "_id" : 1 }
    { "_id" : 2 }
    { "_id" : 3 }
    > db.a.find().showDiskLoc()
    { "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
    { "_id" : 2, "$diskLoc" : { "file" : 0, "offset" : 106736 } }   // the following operation will delete this document
    { "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
    > db.a.update({_id:2},{$set:{arr:[1,2,3]}});
    WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
    > db.a.find().showDiskLoc()
    { "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
    { "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
    { "_id" : 2, "arr" : [ 1, 2, 3 ], "$diskLoc" : { "file" : 0, "offset" : 106864 } }  // migration happened
    > db.a.insert({_id:4});
    WriteResult({ "nInserted" : 1 })
    > db.a.find().showDiskLoc()
    { "_id" : 1, "$diskLoc" : { "file" : 0, "offset" : 106672 } }
    { "_id" : 3, "$diskLoc" : { "file" : 0, "offset" : 106800 } }
    { "_id" : 2, "arr" : [ 1, 2, 3 ], "$diskLoc" : { "file" : 0, "offset" : 106864 } }
    { "_id" : 4, "$diskLoc" : { "file" : 0, "offset" : 106736 } }   // this space was taken up by {_id:2}, reused now.
    >