I have a multiple tables which store 100 million+ rows of data each. There are only a few possible unique values for any given column, so many of the columns have duplicate values.
When I initially designed the schema I decided to use secondary linked tables to store the actual values, in order to optimise the storage space required for the database.
For example:
Instead of a table for storing user agents like this:
I am using 2 tables like this:
Table 1
Table 2
When there are 100 million+ rows I found this schema saves a massive amount of storage space because there are only a few hundred possible user agents and those strings make up the majority of the data.
The issue I am running in to is: Using linked tables to store so much of the string data across many different tables is adding overhead on the development side and making querying the data much slower since joins are required.
My question is: Is there a way I can put all of the columns in a single table, and force mysql to not duplicate the storage required for columns with duplicate values? I'm beginning to think there must be some built in way to handle this type of situation but I have not found anything in my research.
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
Thank you!
After some digging and testing I found what seems to be the best solution: creating an index and foreign key constraint using the varchar column itself, rather than using an ID field.
INNODB supports foreign keys with varchar as well as int: https://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html
Here is an example:
user_agents table:
user_requests table:
I found that when using the varchar itself as the foreign key mysql will optimise the storage on its own, and will only store 1 varchar for each unique user_agent on the disk. Adding 10 million+ user_requests rows adds very little information to the disk.
I also noticed its even more efficient than using an ID to link the tables like in the original post. MySQL seems to do some magic under the hood and can link the columns with very little info on the disk. It's at least 100x more storage efficient than storing all the strings themselves, and several times more efficient than linking using IDs. You also get all the benefit of foreign keys and cascading. No joins are required to query the columns in either direction so the queries are very quick as well!
Cheers!