Search code examples
mysqldatabase-normalization

How to do database normalization with long string?


I need some advice on how to normalize my current database table.

Currently the schema looks like:

+----------+---------------+------+-----+---------+-------+
| Field    | Type          | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| date     | date          | YES  | MUL | NULL    |       | 
| hostname | varchar(255)  | YES  |     | NULL    |       | 
| username | varchar(255)  | YES  | MUL | NULL    |       | 
| path     | varchar(1024) | YES  | MUL | NULL    |       | 
| count    | int(11)       | YES  |     | NULL    |       | 
+----------+---------------+------+-----+---------+-------+

path is a long string and it often repeats many times with the same value. I try to move it away to a different table and treat it as foreign key to current table. This would save a lot of disk space as currently I have 30M records per week.

The problem is that as I insert the data in another table called Path, I somehow need to know if the path exists or not. If so, get the existing entry and assign it as a FK to the new entry to current table, otherwise create a new path entry in Path table. I thought about making the path field in Path table unique, but mysql doesn't allow that and gives Specified key was too long; max key length is 1000 bytes error.

My question would be: what's the best method to do this:

  1. Check if the path value exists in another table or not
  2. If so, get the id of that as FK and create an entry in current table
  3. If not, create a new path entry, get PK as FK and create an entry in current table

Followup question:

Right now I'm doing a long string concatenation to compose an INSERT INTO sql statement to do the data insert. If I split the table, sounds like I need to do a lookup before the insert happens. I would assume that will slow down the process quite a bit. Is there a way to avoid that?


Solution

  • First, when you replace a string with a surrogate ID number, you're doing data compression, not normalization. None of the normal forms require replacing strings with numbers.

    I thought about making the path field in Path table unique, but mysql doesn't allow that and gives Specified key was too long; max key length is 1000 bytes error.

    Well, it seems like Path.path needs to be unique. You have some options.

    • Make path 1000 bytes instead of 1024.
    • Switch to a dbms that enforces unique constraints on longer values. PostgreSQL would work.
    • Check uniqueness at the application level instead of enforcing uniqueness at the database level. Beware of race conditions. Write code to support a report of duplicate paths. (As in "print a report of duplicate paths".) Eventually, you will find duplicate paths.

    The most common approach is to call a stored procedure to insert the row. The SP doesn't check first--it just inserts the row and traps the error raised by a duplicate key. You have to trap errors anyway--lots of things can go wrong besides a duplicate key.

    If the insert succeeds, the SP returns the new ID number. If it fails with a duplicate key error, it selects the ID number for the existing path and returns that instead. What to do if it fails with some other error is application-dependent.

    When you check first, then insert, you need two round-trips to the database for every insert. When you insert first and trap the error, you only need one round-tip for new rows. That reduces the load a little. Or a lot, depending on what you're doing.