Search code examples
mysqlindexingquery-optimizationfilesort

Avoid "filesort" on UNION RESULT


Sub-query 1:

SELECT * from big_table
where category = 'fruits' and name = 'apple'
order by yyyymmdd desc

Explain:

table       |   key           |   extra
big_table   |   name_yyyymmdd |   using where

Looks great!

Sub-query 2:

SELECT * from big_table
where category = 'fruits' and (taste = 'sweet' or wildcard = '*')
order by yyyymmdd desc

Explain:

table       |   key               |   extra
big_table   |   category_yyyymmdd |   using where

Looks great!

Now if I combine those with UNION:

SELECT * from big_table
where category = 'fruits' and name = 'apple'

UNION

SELECT * from big_table
where category = 'fruits' and (taste = 'sweet' or wildcard = '*')

Order by yyyymmdd desc

Explain:

table       |   key      |   extra
big_table   |   name     |   using index condition, using where
big_table   |   category |   using index condition
UNION RESULT|   NULL     |   using temporary; using filesort

Not so good, it uses filesort.

This is a trimmed down version of a more complexed query, here are some facts about the big_table:

  • big_table has 10M + rows
  • There are 5 unique "category"s
  • There are 5 unique "taste"s
  • There are about 10,000 unique "name"s
  • There are about 10,000 unique "yyyymmdd"s
  • I have created single index on each of those fields, plus composite idx such as yyyymmdd_category_taste_name but Mysql is not using it.

Solution

  • SELECT * FROM big_table
        WHERE category = 'fruits'
          AND (  name = 'apple'
              OR taste = 'sweet'
              OR wildcard = '*' )
        ORDER BY yyyymmdd DESC
    

    And have INDEX(catgory) or some index starting with category. However, if more than about 20% of the table is category = 'fruits' will probably decide to ignore the index and simply do a table scan. (Since you say there are only 5 categories, I suspect the optimizer will rightly eschew the index.)

    Or this might be beneficial: INDEX(category, yyyymmdd), in this order.

    The UNION had to do a sort (either in memory on disk, it is not clear) because it was unable to fetch the rows in the desired order.

    A composite index INDEX(yyyymmdd, ...) might be used to avoid the 'filesort', but it won't use any columns after yyyymmdd.

    When constructing a composite index, start with any WHERE columns compared '='. After that you can add one range or group by or order by. More details.

    UNION is often a good choice for avoiding a slow OR, but in this case it would need three indexes

    INDEX(category, name)
    INDEX(category, taste)
    INDEX(category, wildcard)
    

    and adding yyyymmdd would not help unless you add a LIMIT.

    And the query would be:

    ( SELECT * FROM big_table WHERE category = 'fruits' AND name = 'apple' )
    UNION DISTINCT
    ( SELECT * FROM big_table WHERE category = 'fruits' AND taste = 'sweet' )
    UNION DISTINCT
    ( SELECT * FROM big_table WHERE category = 'fruits' AND wildcard = '*' )
    ORDER BY yyyymmdd DESC
    

    Adding a limit would be even messier. First tack yyyymmdd on the end of each of the three composite indexes, then

    ( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
    UNION DISTINCT
    ( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
    UNION DISTINCT
    ( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
    ORDER BY yyyymmdd DESC  LIMIT 10
    

    Adding an OFFSET would be even worse.

    Two other techniques -- "covering" index and "lazy lookup" might help, but I doubt it.

    Yet another technique is to put all the words in the same column and use a FULLTEXT index. But this may be problematical for several reasons.