Search code examples
database-designmongodbdocument-database

MongoDB: storage & when to use relationships


I'm new to MongoDB, so please bear with me.

I have 2 questions:

First, take the following:

// add a record
$obj = array( "title" => "Calvin and Hobbes", "author" => "Bill Watterson" );

Does MongoDB store "title" and "author" as text for every single entry of this object in this collection? Or does it create a schema and convert these to field numbers (or nothing at all and store purely the data)?

My second question is: when should "relations" be used? Let's say I have 100 resellers, who contain (object-wise) 1,000 clients each, and each client has 10 projects. That makes for one huge overall object to manipulate.

In the SQL world, this would all be related "objects". In the Document world, we try to store complete objects by embedding sub-objects.

However, this can be unwieldy. What is the best practice for this? Can someone point me to a guideline please.

Thanks.


Solution

  • Does MongoDB field names for every entry in this collection?

    Yes, MongoDB does store the text for every record. In practice this is not usually too much of a problem if disk space is a limiting factor, you may want to consider something else.

    When should "relations" be used?

    This is more an art then a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:

    • Put as much in as possible

      The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.

    • Separated data that can be referred to from multiple places into its own collection.

      This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer the the same data it is more efficient and less error prone to update a single record and keep references to it in other places.

    • Document size considerations

      MongoDB imposes a 4MB size limit on a single document. In a world of GB of data this sounds small, but it is also 30 million tweets or 250 thousand typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information then one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.

      In the example you gave, I would make 3 separate collections because I do not need to know about the 9 other projects to create a listing for a project. I will keep the queries simple. (But see Protip at the bottom)

    • Complex data structures:

      MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data store specifically designed for this type of data that one should consider as well)

    • Data Consistency

      MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design you schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.

    Pro Tip

    Even when you do use references, it is often a good idea to keep a little of the data from the reference in the parent document. Generally, I keep enough information to build a meaningful link to the descendant in the parent.

    In you example this would mean keeping client names along with the ObjectID in the reseller's document so I could create a link to each client by name without a separate query. If building the URL for the client requires something besides their document id I would store that as well.

    Tricks like this can cut down on the 1+n query situations.