MongoDB: Nested values vs separate collections for search performance - DB schema design

Suppose I have a MongoDB where I have separate texts which consist of statements.

I need to be able to search for the texts, which have certain keywords in statements (also multiple texts that have an occurrence of a search term).

I also need to be able to find all the statements in all the texts added by a particular user, which contain a particular search phrase.

My question: do I need to create a separate collection for statements or can I simply add them as nested into the texts collection?

So, option 1 (separate collections):

Texts collection


text: {
    name: 'nabokov',
    id: '1'
}

Statements collection:

statement: {
    text_id: '1',
    id: '24',
    text: 'He opened the window and saw the sky`
}

Option 2 (nested):


text: {
    name: 'nabokov',
    id: '1'
    statements: [
        id: '24',
        text: 'He opened the window and saw the sky`
    ]
}

Which MongoDB storage schema is better if I want to retrieve statements separately based on keyword search and retain the contextual data (e.g. which text they belong to etc.)

How would this affect the write / read speed for larger DBs (e.g. > 100 Gb).

My texts would be limited to 16 Mb.

Solution

For MongoDB document schema design w.r.t. performance, there are several factors that could be helpful to take into consideration:

What are the cardinalities of the relationships between collections?
What is the expected number/size of documents in a collection?
What are the most frequently used queries?
how often are documents getting updated?

For your scenario, we actually need more context / details from you to work out a more sensible "answer". But here are some common scenarios that I have personally come into before and it might be useful for you as a reference.

text as a root document that is not frequently updated; Most of the queries are based on the statement collection as a child collection.

In this case, it could be a good idea to denormalize the text document and replicating the field name into corresponding statement document. e.g.

statement: {
    text_id: '1',
    text_name: 'nabokov',
    id: '24',
    text: 'He opened the window and saw the sky`
}

In this way, you gain performance boost by avoiding a $lookup to the text collection, while only incurring a small cost of maintaining the new text_name column. The cost is small since the text document is not going to be updated frequently anyway.

a text document will be associated with small number of statements objects/documents only.

In this case, it could be a good idea to go for your option 1 (i.e. keep the statements in an array of text document). The advantage is you can compose rather simple queries and avoid the cost in maintaining another collection of statement.

Here is a very good document to read more about MongoDB schema design.