SQlite view of durations from table of timestamps

I have a SQlite3 table with the following schema:

CREATE TABLE IF NOT EXISTS automationProcessLog (
    id integer PRIMARY KEY,
    timestamp text NOT NULL,    -- expected format yyyy-MM-dd hh:mm:ss.ssssss+ZZ:ZZ
    month_day text GENERATED ALWAYS AS (substr(timestamp, 6, 2) || '-' || substr(timestamp, 9, 2)) STORED,      -- expected format MM-dd
    hour integer GENERATED ALWAYS AS (substr(timestamp, 12, 2)) STORED,
    minute integer GENERATED ALWAYS AS (substr(timestamp, 15, 2)) STORED,
    weekday text NOT NULL,
    sampleID text,
    carrierID text,
    nodeID text,
    IOnode text,    -- expected True/False
    nodeType string GENERATED ALWAYS AS (substr(nodeID, 1, instr(nodeID, '/')-1)) STORED,
    nodeNumberByType integer GENERATED ALWAYS AS (substr(nodeID, instr(nodeID, '/')+1, length(nodeID))) STORED,
    processStep text,
    data text,
    FOREIGN KEY(nodeID) REFERENCES nodes(nodeID)
)

This table contains all events for samples (sampleID) transported by carriers (carrierID) between nodes (nodeID) on an automation. A carrier can only carry one sample at a time.

I would like to create a view where I have sampleID, carrierID, timestamp when the sample was introduced to the automation, nodeID where the sample was introduced, timestamp when the sample was handled (i.e. anything but loaded), nodeID where the sample was handled, and the duration from load to handled.

A sample is loaded when the row has IOnode = "True" and processStep = "Processed"
A sample can be processed at several nodes, and sometimes even repeatedly at the same node, before arriving at the final destination. In such cases duration is still counted from when the sample was loaded onto the automation.
A sample can get re-introduced to the automation after being unloaded. duration is to be counted from the last loading of the sample.

Some pseudo-logic would be to select sampleID, carrierID, timestamp (timestamp_processed), and nodeID (nodeID_processed) WHERE NOT (IOnode = "True" AND processStep = "Processed"), to this add timestamp (timestamp_loaded) and nodeID (nodeID_loaded) for the row with the same sampleID and carrierID with the latest (greatest) timestamp that is earlier (less) than timestamp_processed WHERE (IOnode = "True" AND processStep = "Processed"), and to this calculate the duration between the timestamps.

I have been looking at the LAG function as well as answers to questions like this one, but I have not been able to produce something that works.

Solution

You can use a subquery or CTE to extract loading times, using the lead() window function to also get the next load time for the same sample.

Then join it with the main table to get the corresponding processing steps and durations.

WITH load_times AS (
    SELECT sampleID, timestamp as timestamp_loaded, nodeID as nodeID_loaded, 
           lead(timestamp,1,'2999-12-31') OVER (PARTITION BY sampleID ORDER BY timestamp) as next_load
    FROM automationProcessLog 
    WHERE IOnode = 'True' and processStep = 'Processed'
)
SELECT sampleID, carrierID, timestamp as timestamp_processed, nodeID as nodeID_processed, 
       timestamp_loaded, nodeID_loaded, 
       (unixepoch(timestamp_processed) - unixepoch(timestamp_loaded)) as duration
FROM automationProcessLog l 
JOIN load_times lt ON (l.sampleID = lt.sampleID AND l.timestamp BETWEEN lt.timestamp_loaded AND lt.next_load)
WHERE NOT (l.IOnode = 'True' and l.processStep = 'Processed')