Search code examples
jquerydatabasewebjquery-select2database-performance

Correct way to process lists similar to "StackoverflowQuestion tags"


Having the following tables:

  • Post (Id, Title, Text, etc.)
  • PostTag(PostId, TagId)
  • Tag (Id, Name)

Now the edit form (internet solution) displays the "tags" associated to a post by using the Jquery Select2 control that allows multi entries. This input field returns a list of tag Ids.

I think the correct way to process these entries is:

  • array loop with an insert statement with not exists condition
  • and in the end one delete statement with not exists (to delete the tags that no longer exists).

Or is better to delete all and then insert the new list?

And how about when the table PostTag has the fields (Id, PostId, TagId, CreatedOn) that allows repeated entries? I believe that any of the two previous ways are correct... and all because the select2 control only return a list of tag Ids.

My question applies to the same logic used here in stackoverflow Questions where we need to insert the tags.

What is the correct logic to use for this type of data?

NOTE: The answer I am looking for needs to take in consideration performance and impact on the DB (effects of delete(table lock) / update), I say that because it is easy to make a functional solution that sucks in everything else.


Solution

  • I almost implemented similar tags application six years ago. However, the upper management changed to use sharepoint as the application platform.

    1. I agree with you on:

      "I think the correct way to process these entries is:
      array loop with an insert statement with not exists condition
      and in the end one delete statement with not exists (to delete the tags that no longer exists)."

    2. Why I am against this idea "delete all and then insert the new list???"

    To illustrate my point, I exaggerate a little bit. Imagine that I have a post with 100 tags.

    8:00am, I edit the post by adding one tag, and save to database. Then there are 100 deletes and 101 inserts. These operations generate some DML operations, some transaction logs, pages and indexes maintenance on both delete and add operations, etc. If you use an auto ID field in PostTag table, there will be a gap in the ID field because there are 100 deletes.

    8:05am, I edit the post by delete one tag, and save to database. Then there are 101 deletes and 100 inserts. The side effects are the same as above.

    8:10am, I edit the post by doing nothing, and simply save to database. Then there are 100 deletes and 100 inserts. The side effects are the same as above.

    You can see there are lot of side effects by using this approach even though there are only one real add and one real delete tag operations. This will cause serious performance impacts on the database. Since you mention what Stack Overflow will use, imagine that Stack Overflow use this approach, there are 10,000 users edit the 10,000 posts with 10 tags...

    1. how about when the table PostTag has the fields (Id, PostId, TagId, CreatedOn) that allows repeated entries?

    I would use these fields to implement this feature 'the more entries the better the tags'. This feature is offered in many websites. The more the users click on that tag, the bigger the tag will be.

    CreatedOn field can be used later as business intelligence analysis such as when the tags are associated with posts, etc.