Search code examples
databasedatabase-designnormalization

What benefits of design the tags table as SO did, and when it is good to do the same?


I was reading this article: http://sqlserverpedia.com/wiki/Understanding_the_StackOverflow_Database_Schema and the writer wrote something special about the tags and posts tags tables:

Tags - okay, time to blow out of the bullet points for a second. StackOverflow limits you to five tags per question (answers aren't tagged), and all five are stored in this field. For example, for question 305223, the Tags field is "<offtopic><fun><not-programming-related><jon-skeet>". It's up to you to normalize these. Sam Saffron's SoSlow utility automatically creates Tags and PostsTags tables to normalize these. Next, check the contents of the Tag field carefully. StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag. However, in the database, these are stored as "aspûnet". Just something to be aware of.

What I want to understand are what are the real benefits we can gain from this way of making the Tags table and binding it to Posts?

Update: I am asking about the feature where the tags saved as string in the same Post record, then an external tool loop on Posts and copy tags to Tags and PostsTags tables.


Solution

  • They are doing this to gain benefits of both denormalized table and normalized tables.

    They are saving tags in a column with question therefore they are easy to show when a question is displayed. They do not need to join with Tags or PostTags table. Take all necessary information from 1 table that is Question. Speed.

    Also same approach is better for transaction processing, when adding new question.When normalized approach used, you need to save information to three tables like below.

    Transaction Begin
      Insert Into Question
      Insert or Update Tags
      Insert or Update PostTags
    Transaction Commit.
    

    This will take more time (may need to lock Tags, PostsTags) than simply inserting Question table. Again you gain speed but for inserting. User perceived web site speed increases. A background process which is doing same thing is invisible to user therefore does not affect him.

    On the other hand they are parsing this same column and copying its contents to Tags and PostsTags. What does it bring? First new Tags created periodically, finding this new tags from Question Table would be slow. Second PostsTags will give you normalized view of which questions are related to which tags. A lot of people only browse their interesting tags only. Therefore providing this information is essential. What do you gain again speed.