Search code examples
sqlsqliteindexing

Multi-column index not used when IN and ORDER BY are combined


In SQLite I filter with IN in my WHERE clause and then ORDER BY. The columns in the IN and ORDER BY are both in the index, with the column used in the IN coming first in the index. However a temporary tree is used for ORDER BY, where I expected the index to be used. From The SQLite Query Optimizer Overview I believe the optimizer should use the index for both IN and ORDER BY.

My database table to store metadata for a machine learning training dataset has ~70 million rows. Each row has a column with a random integer (0-9) for splitting into 10% chunks (for training, testing, etc.). Each row also has a column for a pre-shuffled random order using UUID's. The index is a multi-column index on the dataset split number, followed by the random order UUID.

In with multiple values (unexpected index usage):

A simplified example of a desired query:

SELECT "t1"."tic_id"
FROM TessFfiLightcurveMetadata AS "t1"
WHERE "t1"."dataset_split" IN (4, 5)
ORDER BY "t1"."random_order_uuid";

EXPLAIN QUERY PLAN reveals the index is used to get the dataset_split's, but a temporary tree is used to sort on random_order_uuid. I expected the index would be used instead of the temporary tree.

Equals with single value (expected index usage):

If instead I obtain a single dataset_split using =:

SELECT "t1"."tic_id"
FROM TessFfiLightcurveMetadata AS "t1"
WHERE "t1"."dataset_split" = 4
ORDER BY "t1"."random_order_uuid";

then the index is used for both dataset_split selection and random_order_uuid as desired.

In with single value (unexpected index usage):

Using a single dataset_split with an IN:

SELECT "t1"."tic_id"
FROM TessFfiLightcurveMetadata AS "t1"
WHERE "t1"."dataset_split" IN (4)
ORDER BY "t1"."random_order_uuid";

the index is once again only used for the dataset_split selection and random_order_uuid is sorted on a temporary tree.

Equals with multiple values (unexpected index usage):

If multiple = are used on dataset_split separated by OR's:

SELECT "t1"."tic_id"
FROM TessFfiLightcurveMetadata AS "t1"
WHERE "t1"."dataset_split" = 4 OR "t1"."dataset_split" = 5
ORDER BY "t1"."random_order_uuid";

then the index is only used for the dataset_split selection and ordering is done with a temporary tree. The SQLite Query Optimizer Overview states that multiple equalities separated by OR's will be converted to an IN by the optimizer, which suggests IN combined with ORDER BY is the problem.

In with multiple values without order by (expected index usage):

If ORDER BY is omitted:

SELECT "t1"."tic_id"
FROM TessFfiLightcurveMetadata AS "t1"
WHERE "t1"."dataset_split" IN (4, 5);

the index is used for the dataset_split selection.

Why is the index not used for both parts of the query using IN and ORDER BY together? Am I making an incorrect assumption about how the optimization works on the indexes? Have I done something wrong in setting up my indexes/queries?

Checked using SQLite 3.31.1.


Solution

  • An index is like a phone book -- or any other ordered list if that reference is out-of-date.

    If you choose two last names, such as "Smith" and "Lee", then the first names are alphabetized. However, there is no trivial way to combine the first names to get them order by first name -- Zachary Lee is before Abigail Smith. Just appending them gives you two partially sorted lists.

    That is what is happening with the IN. The sorting information is not accurate for an index scan, so most databases just put in a sort for the sorting.

    There is a mechanism called skip-scan (and I think Oracle is the only database that implements it but I will no doubt be corrected if I am wrong) that could allow an index-only optimization under some circumstances. To be honest, though, I don't even know if Oracle would would use it in this case.