SQLite3 query optimization join vs subselect

I am trying to figure out the very best way, (probably doesn't matter in this case) to find the rows of one table, based on the existence of a flag, and an relational id in a row in another table.

here are the schemas:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

I am using SQLite3

there files table will be very large, 10K-5M rows typically. the resume_points will be small <10K with only 1-2 distinct scan_file_id's

so my first thought was:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

a coworker suggested turning the join around:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

then I thought since we know that the number of distinct scan_file_id's will be so small, perhaps a subselect would be optimal (in this rare instance):

select * from files where id in (select distinct scan_file_id from resume_points);

the explain outputs had the following rows: 42, 42, and 48 respectively.

Solution

TL;DR: The best query and index is:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance.

Here's the setup code:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

I considered two indices:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. dirtyFiles index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. uniqueFiles does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. fileLookup did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.