Search code examples
mysqlindexingb-treequery-performance

Determine whether field in MySql indexed column is numeric using index semantics


I have a MySql table where I want to get the count of rows where a given VARCHAR column has a numeric value (convertible to number, you know). Right now, I'm doing a simple REGEXP check on this field. Since this table is very large, I'm using a series of indexes to REGEXP as few rows as possible.

But this VARCHAR column is also indexed. Is there a clever hack of the MySql indexing algorithm that I can exploit to scan even fewer rows? :-/ This is an InnoDB table.


Solution

  • You may not like this, as you are probably already trying to avoid it, but rather than trying to do some clever trick, when I have had situations like this, I add an additional column that stores the varchar in an numeric column (updated using a trigger), and query on that.

    But, there is a way I can see to do it (though I have never had a reason to do this in production), which is to exploit the fact that indexing will put the values in order, such that all that begin with a number are sequenced together.

    Assuming a table like this:

    CREATE TABLE `test_1` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `text_or_number` varchar(255),
      PRIMARY KEY (`id`),
      KEY `test_1_idx` (`text_or_number`)
    ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
    

    You can select only values starting with numbers by exploiting the order of utf8 characters - http://en.wikipedia.org/wiki/UTF-8#Examples

    The lowest value before 0 in utf-8 is "/", and the highest after it is ":", so this should extract only values that start with a number:

    select   cast(text_or_number as unsigned)
    from     test_1
    where    text_or_number < ':'
    and      text_or_number > '/'
    and      cast(text_or_number as unsigned) > 0;
    

    That could still contain values that start with a number, but do not end with one, which is why I have added the cast(...) > 0 clause, but I think mysql will be smart enough to run the where clauses in order, so hopefully it will only run the cast on the subset of rows that start with a numeric char.