Search code examples
sqliteinner-joinself-join

Optimizing Sqlite self joins


Help me optimizing my sqlite query

table:

CREATE TABLE links (
    c   INTEGER NOT NULL,
    position    INTEGER NOT NULL,
    key_id INTEGER  REFERENCES keys(id),
    PRIMARY KEY(c, position, key_id)
) WITHOUT ROWID;

query

select c_1.* from links c_1 
join links c_2 on c_1.key_id = c_2.key_id and c_2.position > c_1.position
join links c_3 on c_1.key_id = c_3.key_id and c_3.position > c_2.position
join links c_4 on c_1.key_id = c_4.key_id and c_4.position > c_3.position
where c_1.c = unicode('A') 
and c_2.c = unicode('p')
and c_3.c = unicode('i')
and c_4.c = unicode('x')

The idea is to implement substring search by indexing each later('c') of a word('key_id'). I am trying to answer a request of: Give me all the words that contain A and have p on a position that is greater than A and i on a position that is greater than p and same with i and x. The above query should match the below words:

  • Apix
  • ToastApizx
  • xpAoopiox

In other words I am trying to optimize the below query:

select * from links where key like '%A%p%i%x%'

The query plan looks like that: enter image description here

Sample results

c|position|key_id
-----------------
65  1   121
65  1   2292
65  1   3919
65  1   3923
65  1   3925
65  1   3933
65  1   3946
65  1   4375
65  1   4375
65  1   4375
65  1   4375

In this example it found three keys. Later I will map it to the word and be able to show what is the prefix that it found.

A the moment there are 240,076 rows in links and it takes 2 sec to execute. How can make it run faster?


Solution

  • Your primary key index is on c, position, key_id, but in your query, your WHERE and ON tests compare c for equality, position for inequality, and key_id for equality. This means that the key_id in the index can't be used.

    From the documentation (Emphasis added):

    Then the index might be used if the initial columns of the index (columns a, b, and so forth) appear in WHERE clause terms. The initial columns of the index must be used with the = or IN or IS operators. The right-most column that is used can employ inequalities. For the right-most column of an index that is used, there can be up to two inequalities that must sandwich the allowed values of the column between two extremes.

    As you discovered, switching the > in the position checks to = causes a huge speed up - using three equality checks means the entire index can be used to look up matching rows.

    If you either recreate your table with a different order of columns in the PK - to c, key_id, position, or add a new index with those three columns in that order, you should see an improvement, because then the entire index can be used to find rows to join, instead of just part of the index, since it follows the constraint that all but the right-most column is using equality tests.

    The query plan I see after that change:

    QUERY PLAN
    |--SEARCH TABLE links AS c_1 USING PRIMARY KEY (c=?)
    |--SEARCH TABLE links AS c_2 USING PRIMARY KEY (c=? AND key_id=? AND position>?)
    |--SEARCH TABLE links AS c_3 USING PRIMARY KEY (c=? AND key_id=? AND position>?)
    `--SEARCH TABLE links AS c_4 USING PRIMARY KEY (c=? AND key_id=? AND position>?)