Search code examples
mysqlindexingblobunique

MySQL / MariaDB: How best to ensure BLOB of variable size is Unique


A BLOB in MySQL can be up to 65,535 bytes long. Whilst most of my data will only use a hundred or so bytes of the BLOB per record, it is possible that in some instances the data can be significantly larger.

I've seen in the answer to: MySQL / MariaDB unique BLOB of fixed length the suggestion to put a bound on the number of significant bytes in the BLOB. This makes sense in the use-case presented there, but I'm wondering what would be the best way to deal with a situation where the size of the data is not known until it needs to be saved nor follows any predictable distribution and could use all the available space (or at least a substantial amount, say 10000+ bytes).

Should I define the field using BLOB (65535)(1) to be on the safe side? I'm worried that doing so will require (and waste) a lot of storage space when there are a large number of records in the database.

(1) It appears that MariaDB/MySQL can only accept a size of 3072, so I cannot use 65535.

Table Definition:

CREATE TABLE IF NOT EXISTS `bibtexAlt` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `resource` INT NOT NULL,
  `bibtexAlt` BLOB (3072),
  CONSTRAINT `bibtexAltFK` FOREIGN KEY (`resource`) REFERENCES `resources` (`rID`) ON DELETE CASCADE,
  PRIMARY KEY (`id`),
  CONSTRAINT `bibtexAltNDX` UNIQUE KEY (`resource`, `id`),
  UNIQUE (`bibtexAlt`(3072)) USING HASH
) ENGINE = InnoDB;

Solution

  • Unfortunately MySQL has a key prefix limit of 3072 so you can't index the whole thing even if you wanted to.

    What you could do is store a checksum of the blob alongside the blob and put a unique index on that. To ensure it's always there we could use triggers, but it's better done as a generated column as @Shadow suggested.

    create table blobtest (
      stuff blob not null,
      checksum binary(32)
        as (unhex(sha2(stuff, 256))) stored
        not null unique
    );
    

    And voila!

    mysql> insert into blobtest (stuff) values ("Basset hounds got long ears");
    Query OK, 1 row affected (0.00 sec)
    
    mysql> insert into blobtest (stuff) values ("Basset hounds got long ears");
    ERROR 1062 (23000): Duplicate entry '\xCC\x17\xED\x7Fp\xB6+\xCE1\xDC\xFA\x8D\x12\xDE8\xBAU"\xEA6af\xF' for key 'checksum'
    
    mysql> update blobtest set stuff = "abc" where stuff != 'abc';
    ERROR 1062 (23000): Duplicate entry '\xBAx\x16\xBF\x8F\x01\xCF\xEAAA@\xDE]\xAE"#\xB0\x03a\xA3\x96\x17' for key 'checksum'
    

    And here's the trigger solution.

    First, let's write a little function for the triggers to share.

    create function blob_checksum(data blob)
      returns binary(32) deterministic
      return unhex(sha2(data, 256));
    

    Then the update and insert triggers.

    create trigger before_insert_blobtest_checksum
    before insert on blobtest
    for each row
    set new.checksum = blob_checksum(new.stuff);
    
    create trigger before_update_blobtest_checksum
    before update on blobtest
    for each row
    set new.checksum = blob_checksum(new.stuff);