Search code examples
mysqlsqlperformancejoin

How costly are JOINs in SQL? And/or, what's the trade off between performance and normalization?


I've found a similar thread but it doesn't really capture the essence of what I'm trying to ask - so I've created a new thread.

I know there is a trade-off between normalization and performance, and I'm wondering what's the best practice for drawing that line? In my particular situation, I have a messaging system that has three distinct tables: messages_threads (overarching message holder), messages_recipients (who is involved), and messages_messages (the actual messages + timestamps).

In order to return the "inbox" view, I have to left join the messages_threads table, users table, and pictures tables to the messages_recipients tables in order to get the information to populate the view (profile picture, sender name, thread id)... and I've still got add a join to messages to retrieve the text from the last message in order to display a "preview" of the last message to the user.

My question is: How costly are JOINS in SQL to performance? I could, for instance, store the sender's name (which I have to left join from users to retrieve) under a field in the messages_threads table called "sendername" - but in terms of normalization I've always been taught to avoid data redundancy?

Where do you draw the line? Or am I overestimating how performance-hampering SQL joins are?


Solution

  • The best practice is to always start with 3NF, and then only consider denormalistion if you find a specific performance problem.

    Performance is just one of the issues you have to deal with with databases. By duplicating data, you run the risk of allowing inconsistent data to be in your database, thus nullifying one of the core principles of relational databases, consistency (the C in ACID) a.

    Yes, joins have a cost, there's no getting around that. However, the cost is usually a lot less than you'd think, and can often be swamped by other factors like network transmission times. By making sure the relevant columns are indexed properly, you can avoid a lot of those costs.

    And, remember the optimisation mantra: measure, don't guess! And measure in a production-like environment. And keep measuring (and tuning) periodically - optimisation is only a set and forget operation if your schema and data never change (very unlikely).


    a) Reversion for performance can usually be made safe by using triggers to maintain consistency. This will, of course, slow down your updates, but may still let your selects run faster.