Search code examples
databasedatabase-designdatabase-normalization

Is normalization always necessary and more efficient?


I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example. Every day I collect data which would look like this in a single table:

TABLE: CAR_ALL

ID DATE CAR LOCATION FUEL FUEL_USAGE MILES BATTERY
123 01.01.2021 Toyota New York 40.3 3.6 79321 78
520 01.01.2021 BMW Frankfurt 34.2 4.3 123232 30
934 01.01.2021 Mercedes London 12.7 4.7 4321 89
123 05.01.2021 Toyota New York 34.5 3.3 79515 77
520 05.01.2021 BMW Frankfurt 20.1 4.6 123489 29
934 05.01.2021 Mercedes London 43.7 5.0 4400 89

In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:

TABLE: CAR_CONSTANT

ID CAR LOCATION
123 Toyota New York
520 BMW Frankfurt
934 Mercedes London

TABLE: CAR_MEASUREMENT

GUID ID DATE FUEL FUEL_USAGE MILES BATTERY
1 123 01.01.2021 40.3 3.6 79321 78
2 520 01.01.2021 34.2 4.3 123232 30
3 934 01.01.2021 12.7 4.7 4321 89
4 123 05.01.2021 34.5 3.3 79515 77
5 520 05.01.2021 20.1 4.6 123489 29
6 934 05.01.2021 43.7 5.0 4400 89

I have two questions:

  1. Does it make sense to create an extra table for DATE?

  2. It is possible that new cars will be included through the collected data. For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it. But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.


Solution

  • The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.

    Pros:

    • With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
    • Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
    • A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.

    Cons:

    • As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.

    To answer your questions directly:

    1. I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
    2. See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.