Search code examples
mysqlnormalization

How can this report (table) be normalized?


Quick backstory...I have some training in database design and administration. I have used them on a small scale in my various positions and during my career. I have a recent bachelors degree in CMIS, and database was a part of it. I understand the theory behind relational databases, how they work, and ultimately how to build them. I just have not had enough practice to be proficient.

My boss decided today that he wants me to take all the reports we get from our partner company and put them into a database. To me, this seems like a daunting task because there are no less than 30 reports and many of them have LOTS and LOTS of data. We get them in excel format.

I understand normalization, but only in theory and only in small scales (like the typical student, instructor, class type of scenario that seems to be ever-present). I am looking at large scales here and am a bit dizzy from it all.

Here is a link to one of the smaller reports. It is already in 1NF (and it comes that way, so no big issues there). I would like to see an example of what this would look like normalized to 3NF and that might help spark something for the rest of the reports.

Now, where I get confused is that none of these reports actually depend on the others. However, there is a lot of repeating data spread out among them. Meaning all of them have tech numbers and tech names, and work order numbers. Though while the tech numbers and names are finite and repeating, the work order numbers might be the same and might not, if that makes sense.

I understand that it would make sense to have a table with just tech information. and then relate the reports using only the number and take out the tech name from any of the reports. I have so many more questions, but will leave it at this for now.

Any btw, before anyone says "its stupid to put data on the web like that", this has been modified so that it is not correct data, and essentially useless.

https://docs.google.com/spreadsheet/ccc?key=0ApvRcXXd6PiWdHFLRWVmNS1VUklpYkFvWVdKQmpvdWc


Solution

  • Normalization through BCNF is based on keys and functional dependencies. There really isn't enough information in the data you posted to normalize to 3NF.

    For example, there's only one value for region, rsp, and office. So, in your sample data, all the other columns will determine one and only one value for region, for rsp, and for office.

    tech_code->region
    tech_name->region
    dish_week_end_date->region
    last_change_date->region
    ...
    tech_code->rsp
    tech_name->rsp
    dish_week_end_date->rsp
    last_change_date->rsp
    ...
    tech_code->office
    tech_name->office
    dish_week_end_date->office
    last_change_date->office
    

    Now, even though last_change_date determines one and only one value for office, is that a real functional dependency? No, it's probably just a coincidence.

    Having said that, I'll make some guesses. Since I can't copy data from that spreadsheet, I'm going to assume some things I wouldn't normally assume, just to get you moving in the right direction.

    "Work order number" isn't a key; two rows have 23464504300055024. Since I can't copy the data, I'm not going to try to work out whether you have a key, and what it might be.

    Guesses at functional dependencies

    office -> region
    office -> rsp
    tech_code -> tech_name
    tech_name -> tech_code
    last_change_date -> dish_week_end_date
    work_order_number -> work_order_type
    work_order_number -> account_number
    work_order_number -> car
    

    The counts will probably depend only on the key, if there is one.

    Is that enough to get you going?