I am thinking of database schema for post and its comments, in context of a social networking application and im wandering which of these two would give better performance:
I am storing comments of a post in "Comments" Table and posts in the "Posts" Table. Now my schema for the comments table looks like this:
postId commentId postedBy Date CommentBody
Since in order to retrieve the comments of a post I would be required to search all posts whose postId matches postId of this specific post and even my postId could not become primary key since the postId would be non unique within the column(since several comments for a single post), therefore I was thinking if I could merge postId and commentId into one single commentId (this becomes primary key) using which postId could also be retrieved. This is how I am thinking:
CommentId would be generated as postId*100+i (where i is the ith comment on the post)
thus in order to retrieve comments for a post(say with postId=8452 ) I would search all posts with commentId(that would be primary key), lying between 845200 & 845299.. instead of searching all comments with postId=8452.. (of course this limits the maximum no of comments to 100). But will this lead to any performance gains?
Here's what you do. Load up a database with representative data at (for example) twice the size you ever expect it to get.
Then run your queries and test them against both versions of the schema.
Then, and this is the good bit, retest this every X
weeks with new up-to-date data to ensure the situation hasn't changed.
That's what being a DBA is all about. Unless your data will never change, database optimisation is not a set-and-forget operation. And the only way to be sure is to test under representative conditions.
Everything else is guesswork. Educated guesswork, don't get me wrong, but I'd rather have a deterministic answer in preference to anyone's guess, especially since the former will adapt to changes.
My favorite optimisation mantra is "Measure, don't guess!"