Search code examples
mysqlsqlduplicatesmysql-workbenchdatabase-normalization

SQL about duplicate data


I have a general question about data normalisation (my first steps into SQL). I was given a CSV file which I have uploaded to MySQL workbench and was asked to do data normalisation (use normal forms). I have found some obvious duplicates, but I am unsure about one thing. There is data about companies and their addresses, but, for example in "Country" field, UNITED STATES is repeated few thousand times and I was unsure does that count as duplicate data? Should I make separate table, for instance, "Countries" and give them unique ID, let's say 1 for US and then update data in the original table where there is US replace it with 1?


Solution

  • If you want to eliminate duplicates, then 'UNITED STATES' is, indeed, a duplicate.

    So, a countries reference table would be called for, if your goal is normalization.

    Note that in this case, normalization will probably reduce the size of the data. Your key for the countries table will probably be an integer (4 bytes) -- although it could also be ISO 2-character or 3-character country codes. Repeating the primary key in the original table should reduce the overall size of the data in the database.