Search code examples
sqlitewindow-functionsgaps-and-islands

Q: SQLite query to find islands in data (that is, sequential rows where a value has met a certain threshold)


I have a SQLite database with, let's say, the percentage of dead cells found in a cancerous organism over time (NOTE: the time column values were changed to simple numbers for readability).

id  time                deadcellspercent
1   000000001000000000  35
2   000000002000000000  54
3   000000003000000000  31
4   000000004000000000  15
5   000000005000000000  38
6   000000006000000000  70
7   000000007000000000  28
8   000000008000000000  13
9   000000009000000000  99
10  000000010000000000  51

I want to create a SQLite query that will return a range of the times where the percentage fell above a certain threshold number. For example, if I say I want the threshold to be >= 20, then the query result should return:

ts_start            ts_end
000000001000000000  000000003000000000
000000005000000000  000000007000000000
000000009000000000  000000010000000000

How do I form the query to do this? I read up on topics such as "SQLite window functions", the "gaps and islands problem", and "analytic functions", but I am an SQL novice and couldn't make heads or tails of it to get the desired result.

Any help would be greatly appreciated.


Solution

  • You were on a right track with window functions and gaps and islands.

    First, let's take your sample data and populate a table with it:

    CREATE TABLE cells(id INTEGER PRIMARY KEY, time TEXT, deadcellspercent INTEGER);
    INSERT INTO cells VALUES(1,'000000001000000000',35);
    INSERT INTO cells VALUES(2,'000000002000000000',54);
    INSERT INTO cells VALUES(3,'000000003000000000',31);
    INSERT INTO cells VALUES(4,'000000004000000000',15);
    INSERT INTO cells VALUES(5,'000000005000000000',38);
    INSERT INTO cells VALUES(6,'000000006000000000',70);
    INSERT INTO cells VALUES(7,'000000007000000000',28);
    INSERT INTO cells VALUES(8,'000000008000000000',13);
    INSERT INTO cells VALUES(9,'000000009000000000',99);
    INSERT INTO cells VALUES(10,'000000010000000000',51);
    

    One possible query (Which uses window functions and thus requires an up to date version of sqlite - 3.25 or newer):

    WITH islands AS (SELECT id, time
                          , row_number() OVER w1 - row_number() OVER w2 AS diff
                          , deadcellspercent >= 20 AS wanted
                     FROM cells
                     WINDOW w1 AS (ORDER BY time)
                          , w2 AS (PARTITION BY deadcellspercent >= 20 ORDER BY time))
    SELECT min(time) AS ts_start, max(time) AS ts_end
    FROM islands
    WHERE wanted = 1
    GROUP BY diff
    ORDER BY diff;
    

    produces:

    ts_start            ts_end            
    ------------------  ------------------
    000000001000000000  000000003000000000
    000000005000000000  000000007000000000
    000000009000000000  000000010000000000
    

    (Heavily influenced by this post on the DBA stackexchange; refer to it for explanations).