I need some advice on how to normalize my current database table.
Currently the schema looks like:
+----------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| date | date | YES | MUL | NULL | |
| hostname | varchar(255) | YES | | NULL | |
| username | varchar(255) | YES | MUL | NULL | |
| path | varchar(1024) | YES | MUL | NULL | |
| count | int(11) | YES | | NULL | |
+----------+---------------+------+-----+---------+-------+
path
is a long string and it often repeats many times with the same value. I try to move it away to a different table and treat it as foreign key to current table. This would save a lot of disk space as currently I have 30M records per week.
The problem is that as I insert the data in another table called Path
, I somehow need to know if the path exists or not. If so, get the existing entry and assign it as a FK to the new entry to current table, otherwise create a new path entry in Path
table. I thought about making the path
field in Path
table unique, but mysql doesn't allow that and gives Specified key was too long; max key length is 1000 bytes
error.
My question would be: what's the best method to do this:
Followup question:
Right now I'm doing a long string concatenation to compose an INSERT INTO
sql statement to do the data insert. If I split the table, sounds like I need to do a lookup before the insert happens. I would assume that will slow down the process quite a bit. Is there a way to avoid that?
First, when you replace a string with a surrogate ID number, you're doing data compression, not normalization. None of the normal forms require replacing strings with numbers.
I thought about making the path field in Path table unique, but mysql doesn't allow that and gives
Specified key was too long; max key length is 1000 bytes
error.
Well, it seems like Path.path needs to be unique. You have some options.
The most common approach is to call a stored procedure to insert the row. The SP doesn't check first--it just inserts the row and traps the error raised by a duplicate key. You have to trap errors anyway--lots of things can go wrong besides a duplicate key.
If the insert succeeds, the SP returns the new ID number. If it fails with a duplicate key error, it selects the ID number for the existing path and returns that instead. What to do if it fails with some other error is application-dependent.
When you check first, then insert, you need two round-trips to the database for every insert. When you insert first and trap the error, you only need one round-tip for new rows. That reduces the load a little. Or a lot, depending on what you're doing.