Search code examples
mysqlquery-optimizationdatabase-performance

What is the most efficient Data type to store categorical string variable in MySQL


I have a table with about 50k rows and multiple columns. Some columns have the data type VARCHAR but the store a unique set of values, Categorical strings.

I'm having some performance issues with this table, so I'm refactoring the data types and did my research and found out SET and ENUM are no better than VARCHAR since there will be a lookup table overhead.

what should I do


Solution

  • I guess by "categorical" you mean those columns have a "controlled vocabulary" – a limited set of possible values.

    Some things you can do to make this table serve you more efficiently. You don't have to do them all. I list them in order of difficulty (difficulty for me at any rate).

    1. Put indexes on the column or columns you will use in WHERE clauses when querying. Doing this is very likely to solve your performance issues: 50k rows is not tiny, but it is small.

      Good index choices are an art. Check out https://use-the-index-luke.com for an introduction. Or, ask another question here if you have performance problems with certain queries.

    2. If possible, and if necessary, declare those columns with COLLATE latin1_bin. That makes them shorter and makes looking them up faster. This won't work if your categorical values are in Arabic or some other language that needs Unicode.

    3. Make a new table. Maybe call it category, and give it an INT UNSIGNED column for category_id and a VARCHAR column for category_name. Then, in your main table use INT UNSIGNED columns rather than VARCHAR columns: treat the new table as a lookup table, and the columns in your main table as numeric references to that table.

      This approach is often used in large (megarow) tables to save RAM and disk space, and to formalize the "controlled vocabulary" of your categories. But I suspect it may be overkill for your app.

    Your conclusions about SETs and ENUMs match my experience. Plus, adding values to ENUMs in a production database can be a shockingly expensive operation.