Search code examples
database-designrelational-databasedatabase-normalization

Is it really better to respect the normal forms and create an additional table for one simple field?


We organize runs, and we want to save run results and the number of kilometers that runners did run in their career. We have one table with every runners data (name, age, gender, etc. including an Id).

I see two ways of keeping the other data:

  • Having two more tables. First one with the run length and the run Id, and the second one with 3 columns: runner Id, run Id and the runner's position. If I am right this would respect the second normal form, with no duplication.
  • Having only one table which has the Runner Id, the run length and the runner position. The run length would there be duplicated for every runner (let's say about 20 each time).

Having two tables will bring additional benefits, like retrieving the order of a specific run or store additional data in the runs table, but we will only need to retrieve the positions of a runner and its total traveled distance. In this case, having only one table looks way more optimized than having two tables, since the duplication is for one field (an integer) and avoids storing a whole table and referencing its Id anyway.

We have only two integers to store, one for the run length and one for the runner's position. Thus, there's no heavy duplicate data.

Would the result be lighter by breaking the second normal form?


Solution

  • In my opinion, three tables data model it is (Oracle syntax, but disregard that; principle is just the same elsewhere):

    SQL> create table runner
      2    (runner_id number primary key,
      3     name      varchar2(20)
      4    );
    
    Table created.
    
    SQL> create table run
      2    (run_id    number primary key,
      3     name      varchar2(20),
      4     length    number
      5    );
    
    Table created.
    
    SQL> create table rxr
      2    (runner_id number references runner,
      3     run_id    number references run,
      4     position  number,
      5     constraint uk_rxr unique (runner_id, run_id)
      6    );
    
    Table created.
    
    SQL>
    

    If it were only runner and only one (additional) table, that would be

    SQL> create table oot
      2    (runner_id number references runner,
      3     length    number,
      4     position  number
      5    );
    
    Table created.
    

    which is an option, but worse than the previous one. You can calculate length - that's true - but you have no idea where (on which runs) that took place, so any additional statistics info would be impossible to collect. For example, which track is my favourite track? (right now, there was Wengen man slalom (alpine race) which would add length and position to my data, but - without actually knowing the venue - I can't compare runners between tracks ("I'm the second best runner in Wengen").


    On the other hand, are you SURE that run (the 2nd table in the 1st piece of code) data won't change through time? For example, in Formula 1 car races: Imola has its ID and track length is well known. But then, they decide to make two corners out of one and length is changed, so even the 1st model I posted would be wrong (kind of). How about moving such data into the 3rd table?

    SQL> drop table rxr;
    
    Table dropped.
    
    SQL> drop table run;
    
    Table dropped.
    
    SQL> create table run
      2    (run_id    number primary key,
      3     name      varchar2(20)
      4    );
    
    Table created.
    
    SQL> create table rxr
      2    (runner_id number references runner,
      3     run_id    number references run,
      4     run_date  date,
      5     length    number,
      6     position  number,
      7     constraint uk_rxr unique (runner_id, run_id)
      8    );
    
    Table created.
    
    SQL>
    

    Basically, think twice! Create data model which will suite your needs in long turn, don't take shortcuts as they might be counter-productive at the end.