Search code examples
javascriptmongodboptimizationquery-optimization

Better to use a document with a lot of objects or just a lot of documents?


So I have a little bit of data on a lot of people in a company such as their names, age, and gender. I'm going to store their info in MongoDB. Would it be better for me to store their info in a lot of documents or in one document as a bunch of separate objects? Are there any performance or memory issues that would make one method superior to the other?

Example methods of storing the data:

Lots of documents

{
  _id: ObjectId('1'),
  name: 'Bart',
  age: 10,
  gender: 'Male'
},
{
  _id: ObjectId('2'),
  name: 'Lisa',
  age: 8,
  gender: 'Female'
}

Lots of objects in one document

{
  _id: ObjectId('1'),
  'Bart': {
    age: 10,
    gender: 'Male'
  },
  'Lisa': {
    age: 8,
    gender: 'Female'
  }
}

For anyone wondering I'd query the second example with Mongo's projection parameter, e.g.

db.families.find({_id:ObjectId('1')},{_id:0,'Bart':1});

Also, the only reason I'm asking this is because I intend to store people from multiple companies here. They will be separated either by collections and the people listed individually as documents like the first example or in documents and the people listed individually as objects in the company document.


Solution

  • The first one is preferrable.

    There is a limitation of 16 MB on each document. So putting everything in a single document is more likely to hit that barrier and you have to manually do the document split and you end up with multiple documents for the same (pseudo)collection. You need additional program code just to find the right fragment or even combine the documents in your application to perform some collection-level operations. Unless there is very good reason to do that, I'd avoid that at any cost.

    Further it presumably matches your access pattern best. You also have more options on optimization, for instance you can define an index on the name, which you can't do for the second example. Also updating that document is faster the smaller the document is (especially when no in-place update can occur).

    If you intend to have multiple companies with users you can either use a separate collection for each or add a company attribute on the document. It depends on how many companies you're going to support, but assuming it won't be just 2 or 3, I'd prefer the latter option. It's easier to maintain, scale (i.e. sharding), optimize (indexes etc) or extend.

    {
      _id: ObjectId('1'),
      name: 'Bart',
      age: 10,
      gender: 'Male'
      company: 'XYZ'
    }
    

    Edit:

    Some more considerations regarding performance. The basic flow of events for both options is the following:

    1-doc strategy (with projection)

    1. find the document by objectId, using index (in memory) fast
    2. loading the entire document (from dics) depending on the size of the document, can be slow
    3. projection (in memory) fast

    n-doc strategy (without projection)

    1. find document by objectId or name, using index (in memory), fast
    2. loading (small) document from disc, slow, but faster than loading big documents

    Especially for the 1-doc strategy, there might be a tipping point, when it gets slower than the n-doc strategy, especially when the document grows large. For smaller docs it might be equal or maybe faster, especially when caching comes into play or other edges cases occur (i.e. the range of names is limited which makes queries for names not very selective, but in that case you'd be screwed with the 1-doc approach anyway)

    Mongo's recommendation for schema design are as follows:

    • 1:1 relation: use embedded documents
    • 1:few relation: use embedded documents
    • 1:many use multiple collections

    What you intend to do is to have company:person relationship, which is likely to be third or the second option. So either you have two collections:

    • companies
    • persons (foreign key to company)

    or

    • company (with persons embedded)

    either way, I would model the person as

    person:
    {
      _id: ObjectId('1'),
      name: 'Bart',
      age: 10,
      gender: 'Male'
      company: 'XYZ' //only for foreign key relationship to separate collection
    }
    

    In case of embedded person, it'd be an array in company

    company:
    {
      name: 'companyA',
      persons: [..] //and not use person's name as key here
    }
    

    I can add an index on persons.name and/or company. So searching for a single person runs entirely in memory (using the index) and loading the person document should be fast, as only a small document is read from disc.

    So either of these approaches gives me the highest flexibility while still being quite fast in access.

    Although there might be cases, when projection is fast (probably when having small "company" documents and they are already cached), I wouldn't go that way because it has some serious disadvantages (with some of them having a negative performance impact as well).

    • you cant have indexes on people
    • you require additional application logic if the documents grow beyond 16MB (which might eventually happen)
    • you can not deal with same names (which might happen)
    • you are less flexible (changing schemas, choosing atomicity of update operation in distributed environment, adding aditional access patterns, like listing all persons of a company)
    • maintenance might get cumbersome (you have to introspect company documents to find the names of people)
    • there might be side effects for sharding or replication which I didn't think of now
    • it violates oo design principles (question yourself: is 'Bart' a property of a family or either 'son' or more generally 'children'?) - making it less maintainable, too

    So even without proving that one appproach is faster than the other, I wouldn't go for the projection approach for filtering users because the disadvantages outweigh the (presumed) advantages by far.