Search code examples
mysqlquery-optimizationinnodb

How can I optimize this range-based query?


I have a MySQL database with four columns: library_name, a string, function_name, a string, low_num, a number, and high_num, a number. low_num and high_num together represent a range, and all the ranges belonging to the same library (having the same library_name) are mutually exclusive. I need to run support a query where, given a ~14,000 str-num pairs, for each pair it returns the row where low_num <= num < high_num and library_name = str. I've made my primary key (library_name, low_num), and my current fastest query is (select * from table where library_name = $name and $num between low_num and high_num limit 1) union (select ...) .... I.e., each pair gets its own query, and then they all get unioned together. However, it's still quite slow (takes around 20 seconds). I also run into memory issues when I do 1 query like this for the 14,000 pairs, so I'm having to break it up into ~14 queries to find 1,000 pairs each (but even one of these 1,000 pair queries on its own takes like 4 seconds). Any ideas how to speed up this query?

SHOW CREATE TABLE: https://db-fiddle.com/f/bjB1zLez2suhdCzt6itge5/0

EXPLAIN SELECT (for just one of the selects in the union): https://db-fiddle.com/f/hyq7aKN89soLZDPyxyJu5T/0


Solution

  • Okay here's the EXPLAIN:

    id select_type table partitions type possible_keys key key_len ref rows filtered Extra
    #3 UNION sym_large NULL range PRIMARY PRIMARY 1026 NULL 14000 11.11 Using where

    The PRIMARY key of the table is on (library_name, low_nm) which are VARCHAR(255) and INT respectively. The key_len of 1026 indicates it is using both columns of the PRIMARY key: 4 bytes for the INT and 255*4 for the utf8mb4 string plus 2 bytes for the string length.

    Probably filtering on both low_num and high_num is not possible, because both are inequality conditions. Usually, an index only helps to reduce the examined rows for N columns in equality conditions, followed by at most one column in an inequality condition. Your conditions are like one equality, followed by two inequality:

    where library_name = $name and low_num <= $num and high_num >= $num
                   (equality)          (inequality)         (inequality)
    

    So only the first two terms of the search can be optimized by the index. The third search term must be applied "the hard way," by testing each row matched by the first two terms.

    There's a possibility that applying the search to high_num instead of low_num would be better for reducing the number of examined rows, but that depends on the data. That is, if on average there are a smaller number of rows between $num and high_num than between low_num and $num, it could be more efficient.

    If that's true, then using a different index on (library_name, high_num) instead may help. But it still would be limited to searching on two out of the three conditions. There's no way to filter on both inequalities in the same query.

    You could also avoid the UNION if you load your 14,000 name/number pairs into a temporary table and do a JOIN instead of thousands of subqueries.

    CREATE TABLE tmp_search_pairs (
      library_name VARCHAR(255) NOT NULL,
      num INT UNSIGNED NOT NULL,
      KEY (library_name, num)
    );
    
    ...insert your search data into the temp table...
    
    SELECT s.*
    FROM sym_large AS s
    JOIN tmp_search_pairs AS p
      ON s.library_name = p.library_name 
      AND s.low_num <= p.num 
      AND s.high_num >= p.num;
    

    I think it's personal preference to use that inequality syntax vs. the BETWEEN syntax. They should optimize the same.