Search code examples
mongodbschema-design

MongoDB data structure with large number internal documents


I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?

I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.

As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like

| USER |
--------
|ID
|Name
|Etc.

|TWEET__|
---------
|ID
|UserID
|Etc

It seems like the logical schema in Mongo would be

User
|-Tweet (0..3000)
  |-Entities
    |-Hashtags (0..10+)
    |-urls (0..5)
    |-user_mentions (0..12)
  |-GeoData (0..20)
|-somegroupID

but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?


Solution

  • All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ

    Chris Winslett @ MongoHQ


    You will find this video interesting:

    http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale

    Essentially, in one document, store one days of tweets for one person. The reasoning:

    • Querying typically consists of days and users

    Therefore, you can have the following index:

    {user_id: 1, date: 1} # Date needs to be last because you will range and sort on the date

    Have fun!

    Chris MongoHQ


    I think it makes the most sense to implement the following:

    user

    { user_id: 123123,
      screen_name: 'cledwyn',
      misc_bits: {...},
      groups: [123123_group_tall_people, 123123_group_techies, ],
      groups_in: [123123_group_tall_people]
    }
    

    tweet

    { tweet_id: 98798798798987987987987,
      user_id: 123123,
      tweet_date: 20120220,
      text: 'MongoDB is pretty sweet',
      misc_bits: {...},
      groups_in: [123123_group_tall_people]
    }