Search code examples
sqlgoogle-bigquerydenormalization

A practical example of denormalization in a SQL database?


I've been reading about denormalization for the last 20 minutes but can't get a concise example with code.

Is this what denormalization is?


1. We have a normalized database:

Table_1:
customer_id (Primary key)
country
city
street
house_number

Table_2:
product_id (Primary Key)
customer_id (Foreign key)
product_storage_building

Table_3:
product_id (Foreign Key)
product_name
product_color
product_origin

  1. However, joining all three tables is taking far too long to run let's say

        SELECT a.*, b.*, c.*
        FROM 
        TABLE_1 AS a
        LEFT JOIN TABLE_2 AS b
        ON a.customer_id = b.customer_id
        LEFT JOIN TABLE_3 AS c
        ON b.product_id = c.product_id
    

So I create a new table out of Table_1 and Table_2

    CREATE OR REPLACE TABLE Denormalized_Data AS
    (
     SELECT customer_id, 
            country, 
            city,
            street, 
            house_number,
            product_id,
            product_storage_building
     FROM Table_1
          LEFT JOIN Table_2
          ON Table_1.cusomter_id = Table_2.customer_id
    )
  1. Then join to Table_3 as follows

     SELECT customer_id, 
            country, 
            city,
            street, 
            house_number,
            product_storage_building,
            Denormalized_Data.product_id
            product_name,
            product_color,
         FROM Denormalized_Data
         LEFT JOIN Table_3
         ON Denormalized_Data.product_id = Table_3.product_id
    

Now this will make my query run faster - can the whole process above be described as denormalization?

Thanks


Solution

  • Yes, you're showing one type of denormalization.

    There are three types of denormalization:

    • Join rows from different tables, so you don't have to use queries with JOIN.
    • Perform aggregate calculations like SUM() or COUNT() or MAX() or others, so you don't have to use queries with GROUP BY.
    • Pre-calculate expensive calculations, so you don't have to use queries with complex expressions in the select-list.

    You're showing an example of the first type. At least you can avoid one of the two joins you intend to do.

    Why not make the denormalized table store the result of joining all three tables?

    What's the downside of using denormalization? You are now storing data redundantly: once in the normalized tables, and a copy in the denormalized table. Suppose you get into work tomorrow and find that the data in these different tables doesn't exactly match up. What happened?

    • Maybe someone inserted a row to the normalized tables without adding the corresponding data to the denormalized table.
    • Maybe someone deleted a row from the normalied tables, without deleting the corresponding row from the denormalized table.
    • Maybe someone inserted or deleted a row in the denormalized table, without the corresponding change in the normalized table.

    How can you tell what happened? Which table is "correct"? This is the risk of denormalization.