This is a performance question for MongoDB database.
I'm reading the book Learn MongoDB The Hard Way
The context is how to model / design the schema of a BlogPost
with Comments
, and the solution discussed is embedding like so:
{ postName: "..."
, comments:
[ { _id: ObjectId(d63a5725f79f7229ce384e1) author: "", text: ""} // each comment
, { _id: ObjectId(d63a5725f79f7229ce384e2) author: "", text: ""} // is a
, { _id: ObjectId(d63a5725f79f7229ce384e3) author: "", text: ""} // subDocument
]
}
In the book his data looks differently but in practice looks like what i have above, since pushing into a list of subDocuments creates _id's
Pointing out the cons of this embedding approach, in the 2'nd counter argument - he says this:
The second aspect relates to write performance. As comments get added to Blog Post over time, it becomes hard for MongoDB to predict the correct document padding to apply when a new document is created. MongoDB would need to allocate new space for the growing document. In addition, it would have to copy the document to the new memory location and update all indexes. This could cause a lot more IO load and could impact overall write performance.
Extract this:
In addition, it would have to copy the document to the new memory location
Question1: What does this mean actually?
What document
he refers to..? the BlogPost document
or the Comment document
.
If he refers to the BlogPost document
(seems like it does), does it mean that the entire ( less then 16MB ) of data get's rewritten / copied entirely to a new location on the hard-disk, every time i'm inserting a sub document?
This is how mongoDB actually works under the hood? Can somebody confirm or disprove this, since it seems like a very big deal to move/copy the entire document around for every write. Especially when it grows toward it's upper limit of 16MB.
Question2:
Then also, what happens when i'm updating a simple field? say a status: true
to status: false
. Will the entire document be moved/copied around in HDD? I will say no, the other document data should be left in place, and the update should happen in place (same memory location), but hmm.. not sure anymore..
Is there a difference between updating a simple field - and adding or removing a subDocument from an array field?
I mean - is this array operation special in some sense? and triggers the document copy on HDD, but simple fields, and nested objects don't?
What about removing an entire big nested object by making the field that holds it null
? Will that trigger a HDD copy? Or will not - since that space is pre-alocated because of how schema is defined...?!
I'm quite confused. My project will need 500 writes/second and i'm trying to detect if this implementation aspects can affect me. Thanks :)
A lot of the details of this behavior specific to the MMAPv1 storage engine which was deprecated in MongoDB 4.0. Additionally, the default storage engine since MongoDB 3.2 has been WiredTiger which has a different system for managing data on disk.
That being said:
Question1: What does this mean actually?
MMAPv1 would write documents into storage files with a pre-allocated "padding factor" that provides empty space for adding additional data in the future. If a document was updated in such a way where the padding was not sufficient, the document would need to be moved to a new space on disk.
In practice, updates that would not grow the size of a document (e.g. incrementing a value, changing status: true
to status: false
, etc) would likely be performed in-place, whereas operations that grow the size of the document had the potential to move a document since the padding factor may not be large enough to hold the larger document.
A good explanation of how MMAPv1 managed documents in data files is described here. More information can be found here.