Search code examples
mysqlindexingquery-optimizationdatabase-indexesquerying

MySQL query and insertion optimisation with varchar(255) UUIDs


I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here

Problem statement

I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.

My thoughts on how to solve the issue

The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.

The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?

Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.

Any assistance is most appreciated. Thank you!


Solution

  • The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.

    You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.

    ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
    

    But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)

    If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.

    Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.

    This is the index to use if you omit UNIQUE:

    ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
    

    To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,

    SELECT uuid, givenname, surname, email
      FROM customer
     WHERE uuid = :uuid
    

    you can create this so-called covering index.

    ALTER TABLE customer 
      ADD INDEX uuid_covering (uuid, givenname, surname, email);
    

    Then your query will be satisfied directly from the index and therefore be faster.

    There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.

    In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.

    (There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)

    (You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )