Search code examples
mysqljoinnormalization

Normalization makes joins accross multiple tables difficult


I had a table for stores containing store name and address. After some discussion, we are now normalizing the the table, putting address in separate tables. This is done for two reasons:

  1. Increase search speed for stores by location / address
  2. Increase execution time for checking misspelled street names using the Levenshtein algorithm when importing stores.

The new structure looks like this (ignore typos):

country;
+--------------------+--------------+------+-----+---------+-------+  
| Field              | Type         | Null | Key | Default | Extra |  
+--------------------+--------------+------+-----+---------+-------+  
| id                 | varchar(2)   | NO   | PRI | NULL    |       |  
| name               | varchar(45)  | NO   |     | NULL    |       |  
| prefix             | varchar(5)   | NO   |     | NULL    |       |  
+--------------------+--------------+------+-----+---------+-------+  

city;
+--------------------+--------------+------+-----+---------+-------+  
| Field              | Type         | Null | Key | Default | Extra |  
+--------------------+--------------+------+-----+---------+-------+  
| id                 | int(11)      | NO   | PRI | NULL    |       |  
| city               | varchar(50)  | NO   |     | NULL    |       |  
+--------------------+--------------+------+-----+---------+-------+  

street;
+--------------------+--------------+------+-----+---------+-------+  
| Field              | Type         | Null | Key | Default | Extra |  
+--------------------+--------------+------+-----+---------+-------+  
| id                 | int(11)      | NO   | PRI | NULL    |       |  
| street             | varchar(50)  | YES  |     | NULL    |       |  
| fk_cityID          | int(11)      | NO   |     | NULL    |       |  
+--------------------+--------------+------+-----+---------+-------+  

address;
+--------------------+--------------+------+-----+---------+-------+  
| Field              | Type         | Null | Key | Default | Extra |  
+--------------------+--------------+------+-----+---------+-------+  
| id                 | int(11)      | NO   | PRI | NULL    |       |  
| streetNum          | varchar(10)  | NO   |     | NULL    |       |  
| street2            | varchar(50)  | NO   |     | NULL    |       |  
| zipcode            | varchar(10)  | NO   |     | NULL    |       |  
| fk_streetID        | int(11)      | NO   |     | NULL    |       |  
| fk_countryID       | int(11)      | NO   |     | NULL    |       |  
+--------------------+--------------+------+-----+---------+-------+  
*street2 is for secondary reference or secondary address in e.g. the US.

store;
+--------------------+--------------+------+-----+---------+-------+  
| Field              | Type         | Null | Key | Default | Extra |  
+--------------------+--------------+------+-----+---------+-------+  
| id                 | int(11)      | NO   | PRI | NULL    |       |  
| name               | varchar(50)  | YES  |     | NULL    |       |
| street             | varchar(50)  | YES  |     | NULL    |       |    
| fk_addressID       | int(11)      | NO   |     | NULL    |       |  
+--------------------+--------------+------+-----+---------+-------+  
*I've left out address columns in this table to shorten code

The new tables have been populated with correct data and the only thing remaining is to add foreign key address.id in store table.

The following code lists all street names correctly:

select a.id, b.street, a.street2, a.zipcode, c.city, a.fk_countryID
from address a
left join street b on a.fk_streetID = b.id
left join city c on b.fk_cityID = c.id
  1. How can I update fk_addressID in store table?
  2. How can I list all stores with correct address?
  3. Is this bad normalization considering the reasons given above?

UPDATE

It seems like the following code lists all stores with correct address - however it is a bit slow (I have about 2000 stores):

select a.id, a.name, b.id, c.street
from sl_store a, sl_address b, sl_street c
where b.fk_streetID = c.id
and a.street1 = c.street
group by a.name
order by a.id

Solution

  • I'm not going to speak to misspellings. Since you're importing the data, misspellings are better handled in a staging table.

    Let's look at this slightly simplified version.

    create table stores
    (
      store_name varchar(50) primary key,
      street_num varchar(10) not null,
      street_name varchar(50) not null,
      city varchar(50) not null,
      state_code char(2) not null,
      zip_code char(5) not null,
      iso_country_code char(2) not null,
      -- Depending on what kind of store you're talking about, you *could* have
      -- two of them at the same address. If so, drop this constraint.
      unique (street_num, street_name, city, state_code, zip_code, iso_country_code)
    );  
    
    insert into stores values 
    ('Dairy Queen #212',  '232', 'N 1st St SE',   'Castroville',  'CA', '95012', 'US'),
    ('Dairy Queen #213',  '177', 'Broadway Ave',  'Hartsdale',    'NY', '10530', 'US'),
    ('Dairy Queen #214', '7640', 'Vermillion St', 'Seneca Falls', 'NY', '13148', 'US'),
    ('Dairy Queen #215', '1014', 'Handy Rd',      'Olive Hill',   'KY', '41164', 'US'),
    ('Dairy Mart #101',   '145', 'N 1st St SE',   'Castroville',  'CA', '95012', 'US'),
    ('Dairy Mart #121',  '1042', 'Handy Rd',      'Olive Hill',   'KY', '41164', 'US');
    

    Although a lot of people firmly believe that ZIP code determines city and state in the US, that's not the case. ZIP codes have to do with how carriers drive their routes, not with geography. Some cities straddle the borders between states; single ZIP code routes can cross state lines. Even Wikipedia knows this, although their examples might be out of date. (Delivery routes change constantly.)

    So we have a table that has two candidate keys,

    • {store_name}, and
    • {street_num, street_name, city, state_code, zip_code, iso_country_code}

    It has no non-key attributes. I think this table is in 5NF. What do you think?

    If I wanted to increase the data integrity for street names, I might start with something like this.

    create table street_names
    (
      street_name varchar(50) not null,
      city varchar(50) not null,
      state_code char(2) not null,
      iso_country_code char(2) not null,
      primary key (street_name, city, state_code, iso_country_code)
    );  
    
    insert into street_names
    select distinct street_name, city, state_code, iso_country_code
    from stores;
    
    alter table stores
    add constraint streets_from_street_names
    foreign key             (street_name, city, state_code, iso_country_code)
    references street_names (street_name, city, state_code, iso_country_code);
    -- I don't cascade updates or deletes, because in my experience
    -- with addresses, that's almost never the right thing to do when a 
    -- street name changes.
    

    You could (and probably should) repeat this process for city names, state names (state codes), and country names.

    Some problems with your approach

    You can apparently enter a street id number for a street that's in the US, along with the country id for Croatia. (The "full name" of a city, so to speak, is the kind of fact you probably want to store in order to increase data integrity. That's probably also true of the "full name" of a street.)

    Using id numbers for every bit of data greatly increases the number of joins required. Using id numbers doesn't have anything to do with normalization. Using id numbers without corresponding unique constraints on the natural keys--an utterly commonplace mistake--allows duplicate data.