Search code examples
mysqlindexingquery-optimizationfenwick-treerange-query

MySQL: Forcing query to use indices with local variable in WHERE clause


Context

I have an application that selects a weighted random entry from a table for which prefix summation (of weights) is a crucial part. The simplified table definition looks like this:

CREATE TABLE entries (
    id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    weight DECIMAL(9, 3),
    fenwick DECIMAL(9, 3)
) ENGINE=MEMORY;

where `fenwick` stores the values within the Fenwick tree representation of `weights`.

Let the "range" of each entry spans between its prefix sum and its prefix sum + its weight. The application must generate a random number @r between 0 and SUM(weight) and finds the entry whose range encompasses @r, like this:

Weighted random entry selection

The Fenwick tree, combined with the MEMORY engine and a binary search, should allow me to find the appropriate entry in O(lg^2(n)) time, as opposed to O(n) time with the naive query:

SELECT a.id-1 FROM (SELECT *, (@x:=@x+weight) AS counter FROM entries 
    CROSS JOIN (SELECT @x:=0) a
    HAVING counter>@r LIMIT 1) a;

Research

I have been trying to condense the prefix sum operation into one query (as opposed to several array accesses seen in scripting languages) due to the overhead of multiple queries. In the process, I've realized that the traditional method of summation, which involves accessing elements in descending key order, would only sum the first element. I was suspicious that MySQL runs through tables linearly when variables are present in the WHERE clause. Here's the query:

SELECT
SUM(1) INTO @garbage
FROM entries 
CROSS JOIN (
    SELECT @sum:=0,
        @n:=@entryid
) a
WHERE id=@n AND @n>0 AND (@n:=@n-(@n&(-@n))) AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/

where @entryid is the ID of the entry whose prefix sum we are computing. I did create a query that did work (alongside a function lft that returns the leftmost bit of an integer):

SET @n:=lft(@entryid);
SET @sum:=0;
SELECT
    SUM(1) INTO @garbage
    FROM entries
    WHERE id=@n 
      AND @n<=@entryid 
      AND (@n:=@n+lft(@entryid^@n)) 
      AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/

but it only confirmed my suspicion of a linear search. So too does the EXPLAIN query:

+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
|    1 | SIMPLE      | entries | ALL  | NULL          | NULL | NULL    | NULL | 752544 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

The indexes:

SHOW INDEXES FROM entries;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| entries |          0 | PRIMARY  |            1 | id          | NULL      |       752544 |     NULL | NULL   |      | HASH       |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

Now, I have seen many a question asking how to eliminate variables in the WHERE clause so that the optimizer can work on the query. However, I can't think of a way this query can do without id=@n. I've contemplated putting the key values of entries I want to sum into a table and using joins, but I believe that I'll get undesirable effects: either a plethora of tables, or a linear search by evaluating against @entryid anyways.

Question

Is there any way to force MySQL to use the indices for this query? I will even try a different DBMS if they offer this functionality.


Solution

  • Disclaimer

    Fenwick trees are new to me, I only discovered them while finding this post. The results presented here are based on my understanding and some research, but I am by no means a fenwick tree expert, I might have missed things.

    Reference materials

    Explanation of how fenwick tree works

    https://stackoverflow.com/a/15444954/1157540 reproduced from https://cs.stackexchange.com/a/10541/38148

    https://cs.stackexchange.com/a/42816/38148

    Usage of fenwick trees

    https://en.wikipedia.org/wiki/Fenwick_tree

    https://en.wikipedia.org/wiki/Prefix_sum

    Problem 1, finding the weight of a given entry

    Given the following table

    CREATE TABLE `entries` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `weight` decimal(9,3) DEFAULT NULL,
      `fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
      PRIMARY KEY (`id`)
    ) ENGINE=INNODB;
    

    and given data already populated (see the http://sqlfiddle.com/#!9/be1f2/1 provided by concat), how to count for weight for a given entry @entryid ?

    The key concept to understand here, is that the structure of the fenwick index is based on math and bitwise operations on the id values themselves.

    Queries should typically use primary key lookups only (WHERE ID = value).

    Any query using sorting (ORDER BY) or ranges (WHERE (value1 < ID) AND (ID < value2)) misses the point, and does not traverse the tree in the intended order.

    For example, with the key 60:

    SET @entryid := 60;
    

    Let's decompose the value 60 in binary

    mysql> SELECT (@entryid & 0x0080) as b8,
        ->        (@entryid & 0x0040) as b7,
        ->        (@entryid & 0x0020) as b6,
        ->        (@entryid & 0x0010) as b5,
        ->        (@entryid & 0x0008) as b4,
        ->        (@entryid & 0x0004) as b3,
        ->        (@entryid & 0x0002) as b2,
        ->        (@entryid & 0x0001) as b1;
    +------+------+------+------+------+------+------+------+
    | b8   | b7   | b6   | b5   | b4   | b3   | b2   | b1   |
    +------+------+------+------+------+------+------+------+
    |    0 |    0 |   32 |   16 |    8 |    4 |    0 |    0 |
    +------+------+------+------+------+------+------+------+
    1 row in set (0.00 sec)
    

    In other words, keeping only the bits set, we have

    32 + 16 + 8 + 4 = 60
    

    Now, remove the lowest bits set one by one to navigate the tree:

    32 + 16 + 8 + 4 = 60
    32 + 16 + 8 = 56
    32 + 16 = 48
    32
    

    This gives the path (32, 48, 56, 60) to access element 60.

    Note that transforming 60 to (32, 48, 56, 60) only requires bit math on the ID value itself: no access to the table or the database is needed, and this computation can be done in the client issuing the query.

    The fenwick weigth of element 60 is then

    mysql> select sum(fenwick) from entries where id in (32, 48, 56, 60);
    +--------------+
    | sum(fenwick) |
    +--------------+
    |       32.434 |
    +--------------+
    1 row in set (0.00 sec)
    

    Verification

    mysql> select sum(weight) from entries where id <= @entryid;
    +-------------+
    | sum(weight) |
    +-------------+
    |      32.434 |
    +-------------+
    1 row in set (0.00 sec)
    

    Now, let's compare the efficiency of these queries.

    mysql> explain select sum(fenwick) from entries where id in (32, 48, 56, 60);
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    | id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    |  1 | SIMPLE      | entries | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL |    4 |   100.00 | Using where |
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    

    or, presented differently

    explain format=json select sum(fenwick) from entries where id in (32, 48, 56, 60);
    {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "5.61"
        },
        "table": {
          "table_name": "entries",
          "access_type": "range",
          "possible_keys": [
            "PRIMARY"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "id"
          ],
          "key_length": "4",
          "rows_examined_per_scan": 4,
          "rows_produced_per_join": 4,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "4.81",
            "eval_cost": "0.80",
            "prefix_cost": "5.61",
            "data_read_per_join": "64"
          },
          "used_columns": [
            "id",
            "fenwick"
          ],
          "attached_condition": "(`test`.`entries`.`id` in (32,48,56,60))"
        }
      }
    

    So, the optimizer fetched 4 rows by primary key (there are 4 values in the IN clause).

    When not using the fenwick index, we have

    mysql> explain select sum(weight) from entries where id <= @entryid;
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    | id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    |  1 | SIMPLE      | entries | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL |   60 |   100.00 | Using where |
    +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    

    Or, presented differently

    explain format=json select sum(weight) from entries where id <= @entryid;
    {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "25.07"
        },
        "table": {
          "table_name": "entries",
          "access_type": "range",
          "possible_keys": [
            "PRIMARY"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "id"
          ],
          "key_length": "4",
          "rows_examined_per_scan": 60,
          "rows_produced_per_join": 60,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "13.07",
            "eval_cost": "12.00",
            "prefix_cost": "25.07",
            "data_read_per_join": "960"
          },
          "used_columns": [
            "id",
            "weight"
          ],
          "attached_condition": "(`test`.`entries`.`id` <= (@`entryid`))"
        }
      }
    

    Here the optimizer performed an index scan, reading 60 rows.

    With ID=60, the benefit of fenwick is 4 fetch compared to 60.

    Now, consider how this scales, with values up to 64K for example.

    With fenwick, a 16 bits value will have at most 16 bits set, hence the number of elements to lookup will be 16 at most.

    Without fenwick, a scan can read up to 64K entries (and will read 32K in average).

    Problem 2, finding the entry for a given weight

    The OP problem was to find an entry for a given weight.

    For example

    SET @search_weight := 35.123;
    

    To illustrate the algorithm, this post details how lookups are done (sorry if this is too verbose)

    SET @found_id := 0;
    

    First, find how many entries there is.

    SET @max_id := (select id from entries order by id desc limit 1);
    

    In the test data, max_id is 156.

    Because 128 <= max_id < 256, the highest bit to start the search is 128.

    mysql> set @search_id := @found_id + 128;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +-----+---------+----------------+---------+
    | id  | fenwick | @search_weight | action  |
    +-----+---------+----------------+---------+
    | 128 |  66.540 |         35.123 | discard |
    +-----+---------+----------------+---------+
    

    Weight 66.540 is greater than our search, so 128 is discarded, move on to the next bit.

    mysql> set @search_id := @found_id + 64;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+--------+
    | id | fenwick | @search_weight | action |
    +----+---------+----------------+--------+
    | 64 |  33.950 |         35.123 | keep   |
    +----+---------+----------------+--------+
    

    Here we need to keep this bit (64), and count the weight found:

    set @found_id := @search_id, @search_weight := @search_weight - 33.950;
    

    Then continue to the next bits:

    mysql> set @search_id := @found_id + 32;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+---------+
    | id | fenwick | @search_weight | action  |
    +----+---------+----------------+---------+
    | 96 |  16.260 |          1.173 | discard |
    +----+---------+----------------+---------+
    
    mysql> set @search_id := @found_id + 16;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+---------+
    | id | fenwick | @search_weight | action  |
    +----+---------+----------------+---------+
    | 80 |   7.394 |          1.173 | discard |
    +----+---------+----------------+---------+
    
    mysql> set @search_id := @found_id + 8;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+---------+
    | id | fenwick | @search_weight | action  |
    +----+---------+----------------+---------+
    | 72 |   3.995 |          1.173 | discard |
    +----+---------+----------------+---------+
    
    mysql> set @search_id := @found_id + 4;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+---------+
    | id | fenwick | @search_weight | action  |
    +----+---------+----------------+---------+
    | 68 |   1.915 |          1.173 | discard |
    +----+---------+----------------+---------+
    
    mysql> set @search_id := @found_id + 2;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+--------+
    | id | fenwick | @search_weight | action |
    +----+---------+----------------+--------+
    | 66 |   1.146 |          1.173 | keep   |
    +----+---------+----------------+--------+
    

    We found another bit here

    set @found_id := @search_id, @search_weight := @search_weight - 1.146;
    
    mysql> set @search_id := @found_id + 1;
    mysql> select id, fenwick, @search_weight,
        ->    if (fenwick <= @search_weight, "keep", "discard") as action
        ->    from entries where id = @search_id;
    +----+---------+----------------+--------+
    | id | fenwick | @search_weight | action |
    +----+---------+----------------+--------+
    | 67 |   0.010 |          0.027 | keep   |
    +----+---------+----------------+--------+
    

    And one more

    set @found_id := @search_id, @search_weight := @search_weight - 0.010;
    

    The final search result is:

    mysql> select @found_id, @search_weight;
    +-----------+----------------+
    | @found_id | @search_weight |
    +-----------+----------------+
    |        67 |          0.017 |
    +-----------+----------------+
    

    Verification

    mysql> select sum(weight) from entries where id <= 67;        
    +-------------+                                               
    | sum(weight) |                                               
    +-------------+                                               
    |      35.106 |                                               
    +-------------+                                               
    
    mysql> select sum(weight) from entries where id <= 68;
    +-------------+
    | sum(weight) |
    +-------------+
    |      35.865 |
    +-------------+
    

    And indeed,

    35.106 (fenwick[67]) <= 35.123 (search) <= 35.865 (fenwick[68])
    

    The search look up values to resolve 1 bit at a time, and each lookup result decides the value of the next ID to search.

    The queries given here are for illustration. In a real application, the code should just be a loop that contains:

    SELECT fenwick from entries where id = ?;
    

    with the application code (or a stored procedure) implementing the logic related to @found_id, @search_id and @search_weight.

    General comments

    • Fenwick trees are used for prefix computations. It only makes sense to use these trees if the problem to solve involves prefixes in the first place. Wikipedia has some pointers for applications. The OP did not describes what the fenwick tree was used for, unfortunately.
    • Fenwick trees are a tradeoff between lookup complexity and update complexity, so they only make sense for data with is not static. For static data, computing a full prefix once will be more efficient.
    • The tests performed used an INNODB table, for which the primary key index is sorted, so computing max_id is a simple O(1) operation. If max_id is already available elsewhere, I think even using a MEMORY table with a HASH index on ID will work, as only primary key lookups are used.

    P.S.

    sqlfiddle is down today, so posting the raw data used (originally provided by concat) so that people interrested can re run the tests.

    INSERT INTO `entries` VALUES (1,0.480,0.480),(2,0.542,1.022),(3,0.269,0.269),(4,0.721,2.012),(5,0.798,0.798),(6,0.825,1.623),(7,0.731,0.731),(8,0.181,4.547),(9,0.711,0.711),(10,0.013,0.724),(11,0.930,0.930),(12,0.613,2.267),(13,0.276,0.276),(14,0.539,0.815),(15,0.867,0.867),(16,0.718,9.214),(17,0.991,0.991),(18,0.801,1.792),(19,0.033,0.033),(20,0.759,2.584),(21,0.698,0.698),(22,0.212,0.910),(23,0.965,0.965),(24,0.189,4.648),(25,0.049,0.049),(26,0.678,0.727),(27,0.245,0.245),(28,0.190,1.162),(29,0.214,0.214),(30,0.502,0.716),(31,0.868,0.868),(32,0.834,17.442),(33,0.566,0.566),(34,0.327,0.893),(35,0.939,0.939),(36,0.713,2.545),(37,0.747,0.747),(38,0.595,1.342),(39,0.733,0.733),(40,0.884,5.504),(41,0.218,0.218),(42,0.437,0.655),(43,0.532,0.532),(44,0.350,1.537),(45,0.154,0.154),(46,0.721,0.875),(47,0.140,0.140),(48,0.538,8.594),(49,0.271,0.271),(50,0.739,1.010),(51,0.884,0.884),(52,0.203,2.097),(53,0.361,0.361),(54,0.197,0.558),(55,0.903,0.903),(56,0.923,4.481),(57,0.906,0.906),(58,0.761,1.667),(59,0.089,0.089),(60,0.161,1.917),(61,0.537,0.537),(62,0.201,0.738),(63,0.397,0.397),(64,0.381,33.950),(65,0.715,0.715),(66,0.431,1.146),(67,0.010,0.010),(68,0.759,1.915),(69,0.763,0.763),(70,0.537,1.300),(71,0.399,0.399),(72,0.381,3.995),(73,0.709,0.709),(74,0.401,1.110),(75,0.880,0.880),(76,0.198,2.188),(77,0.348,0.348),(78,0.148,0.496),(79,0.693,0.693),(80,0.022,7.394),(81,0.031,0.031),(82,0.089,0.120),(83,0.353,0.353),(84,0.498,0.971),(85,0.428,0.428),(86,0.650,1.078),(87,0.963,0.963),(88,0.866,3.878),(89,0.442,0.442),(90,0.610,1.052),(91,0.725,0.725),(92,0.797,2.574),(93,0.808,0.808),(94,0.648,1.456),(95,0.817,0.817),(96,0.141,16.260),(97,0.256,0.256),(98,0.855,1.111),(99,0.508,0.508),(100,0.976,2.595),(101,0.353,0.353),(102,0.840,1.193),(103,0.139,0.139),(104,0.178,4.105),(105,0.469,0.469),(106,0.814,1.283),(107,0.664,0.664),(108,0.876,2.823),(109,0.390,0.390),(110,0.323,0.713),(111,0.442,0.442),(112,0.241,8.324),(113,0.881,0.881),(114,0.681,1.562),(115,0.760,0.760),(116,0.760,3.082),(117,0.518,0.518),(118,0.313,0.831),(119,0.008,0.008),(120,0.103,4.024),(121,0.488,0.488),(122,0.135,0.623),(123,0.207,0.207),(124,0.633,1.463),(125,0.542,0.542),(126,0.812,1.354),(127,0.433,0.433),(128,0.732,66.540),(129,0.358,0.358),(130,0.594,0.952),(131,0.897,0.897),(132,0.701,2.550),(133,0.815,0.815),(134,0.973,1.788),(135,0.419,0.419),(136,0.175,4.932),(137,0.620,0.620),(138,0.573,1.193),(139,0.004,0.004),(140,0.304,1.501),(141,0.508,0.508),(142,0.629,1.137),(143,0.618,0.618),(144,0.206,8.394),(145,0.175,0.175),(146,0.255,0.430),(147,0.750,0.750),(148,0.987,2.167),(149,0.683,0.683),(150,0.453,1.136),(151,0.219,0.219),(152,0.734,4.256),(153,0.016,0.016),(154,0.874,0.891),(155,0.325,0.325),(156,0.002,1.217);
    

    P.S. 2

    Now with a full sqlfiddle:

    http://sqlfiddle.com/#!9/d2c82/1