Search code examples
rpostgresqlgaps-and-islandsrelational-divisiongaps-in-data

Forming groups of spatio-temporally near trajectories in R or PostgreSQL


I'm doing some trajectory analysis using R and PostgreSQL. In order to form groups of trajectory segments where successive positions are spatio-temporally near, I've created the following table. What I'm still missing is the column group_id, which is what my question is about.

bike_id1    datetime             bike_id2    near     group_id
      1    2016-05-28 11:00:00          2    TRUE            1
      1    2016-05-28 11:00:05          2    TRUE            1
      1    2016-05-28 11:00:10          2    FALSE          NA
[...]
      2    2016-05-28 11:00:05          3    TRUE            1
      2    2016-05-28 11:00:10          3    TRUE            1

This is the result of multiple comparisons between each trajectory with every other (all combinations without repetitions) and an inner join on datetime (sampled always on a multiple of 5 seconds). It shows that for certain positions, bike 1 and 2 were sampled at the same time and are spatially near (some arbitrary threshold).

Now I'd like to give away unique ids for the segments where two bikes are spatio-temporally near (group_id). This is where I'm stuck: I'd want the group_id to respect groups with multiple trajectories. The method for assigning the group_id should realize that if bike 1 and 2 are in a group at 2016-05-28 11:00:05, then 3 belongs to the same group if it is near to 2 at that same timestamp (2016-05-28 11:00:05).

Are there tools within R or PostgreSQL that would help me with this task? Running a loop through the table seems like the wrong way to go about this.

EDIT: As @wildplasser pointed out, this seems to be a gaps-and-islands problem which is traditionally solved using SQL. He has kindly produced some sample data that I have slightly extended and will include in the question.

CREATE TABLE nearness
        -- ( seq SERIAL NOT NULL UNIQUE -- surrogate for conveniance
        ( bike1 INTEGER NOT NULL
        , bike2 INTEGER NOT NULL
        , stamp timestamp NOT NULL
        , near boolean
        , PRIMARY KEY(bike1,bike2,stamp)
        );
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
 (1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)
,(4,5, '2016-05-28 11:00:00', FALSE)
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
        ;

Solution

  • UPDATE: [after understanding the real question ;-] Finding the equivalence groups of bikes (set, bike_set) is in fact a relational-division problem. Finding the begin and end of the segments (clust) within a set of bikes is basically the same as in the first attempt.

    • The clusters are stored in arrays: (I trust on the clusters not becoming too large)
    • The array is built by a recursive query: every pair of bikes that has one member in common with the current cluster is merged into it.
    • At the end, the arrays contain all bike_id's that happened to be within reach at the particular time.
    • (plus some intermediate rows which need to be suppressed later by the uniq CTE)
    • The rest is more-or-less standard gap-detection in time-series.

    NOTE: the code trusts on (bike2 > bike1). This is needed to keep the array sorted and thus canonical. The actual content is not guaranteed to be canonical because the order of addition in the recursive query cannot be guaranteed. This may need some extra work.


    CREATE TABLE nearness
            ( bike1 INTEGER NOT NULL
            , bike2 INTEGER NOT NULL
            , stamp timestamp NOT NULL
            , near boolean
            , PRIMARY KEY(bike1,bike2,stamp)
            );
    INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
     (1,2, '2016-05-28 11:00:00', TRUE)
    ,(1,2, '2016-05-28 11:00:05', TRUE)
    ,(1,2, '2016-05-28 11:00:10', TRUE)
    ,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
    ,(1,2, '2016-05-28 11:00:25', TRUE)
    ,(1,2, '2016-05-28 11:00:30', FALSE)    -- <<-- these False-records serve no pupose
    ,(4,5, '2016-05-28 11:00:00', FALSE)    -- <<-- result would be the same without them
    ,(4,5, '2016-05-28 11:00:05', FALSE)
    ,(4,5, '2016-05-28 11:00:10', TRUE)
    ,(4,5, '2016-05-28 11:00:15', TRUE)
    ,(4,5, '2016-05-28 11:00:20', TRUE)
    ,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
    ,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
    ,(6,7, '2016-05-28 11:00:00', FALSE)
    ,(6,7, '2016-05-28 11:00:05', FALSE)
            ;
    
    
            -- Recursive union-find to glue together sets of bike_ids
            -- ,occuring at the same moment.
            -- Sets are represented as {ordered,unique} arrays here
    WITH RECURSIVE wood AS (
            WITH omg AS (
                    SELECT bike1 ,bike2,stamp
                    , row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
                    , ARRAY[bike1,bike2]::integer[] AS arr
                    FROM nearness n WHERE near = True
                    )
            -- Find all existing combinations of bikes
            SELECT o1.stamp, o1.seq
                    , ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
            FROM omg o1
            UNION ALL
            SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
                    , CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
                    ELSE  w.arr || o2.bike1 END AS arr
            FROM omg o2
            JOIN wood w
                    ON o2.stamp = w.stamp AND o2.seq > w.seq
                    AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
                    AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
            )
    , uniq  AS (    -- suppress partial sets caused by the recursive union-find buildup
            SELECT * FROM wood w
            WHERE NOT EXISTS (SELECT * FROM wood nx
                    WHERE nx.stamp = w.stamp
                    AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal 
                    )
            )
    , xsets AS (    -- make unique sets of bikes
            SELECT DISTINCT arr
            -- , MIN(seq) AS grp
            FROM uniq
            GROUP BY arr
            )
    , sets AS (     -- enumerate the sets of bikes
            SELECT arr
            , row_number() OVER () AS setnum
            FROM xsets
            )
    , drag AS (             -- Detect beginning and end of segments of consecutive observations
            SELECT u.*      -- within a constant set of bike_ids
            -- Edge-detection begin of group
            , NOT EXISTS (SELECT * FROM uniq nx
                    WHERE nx.arr = u.arr
                    AND nx.stamp < u.stamp
                    AND nx.stamp >= u.stamp - '5 sec'::interval
                    ) AS is_first
            -- Edge-detection end of group
            , NOT EXISTS (SELECT * FROM uniq nx
                    WHERE nx.arr = u.arr
                    AND nx.stamp > u.stamp
                    AND nx.stamp <= u.stamp + '5 sec'::interval
                    ) AS is_last
            , row_number() OVER(ORDER BY arr,stamp) AS nseq
            FROM uniq u
            )
    , top AS ( -- id and groupnum for the start of a group
            SELECT nseq
            , row_number() OVER () AS clust
            FROM drag
            WHERE is_first
            )
    , bot AS ( -- id and groupnum for the end of a group
            SELECT nseq
            , row_number() OVER () AS clust
            FROM drag
            WHERE is_last
            )
    SELECT w.seq as orgseq  -- results, please ...
            , w.stamp
            , g0.clust AS clust
            , row_number() OVER(www) AS rn
            , s.setnum, s.arr AS bike_set
            FROM drag w
            JOIN sets s ON s.arr = w.arr
            JOIN top g0 ON g0.nseq <= w.seq
            JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
            WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
            ORDER BY g1.clust, w.stamp
            ;
    

    Result:


     orgseq |        stamp        | clust | rn | setnum | bike_set 
    --------+---------------------+-------+----+--------+----------
          1 | 2016-05-28 11:00:00 |     1 |  1 |      1 | {1,2}
          4 | 2016-05-28 11:00:20 |     3 |  1 |      1 | {1,2}
          5 | 2016-05-28 11:00:25 |     3 |  2 |      1 | {1,2}
          6 | 2016-05-28 11:00:05 |     4 |  1 |      3 | {1,2,3}
          7 | 2016-05-28 11:00:10 |     4 |  2 |      3 | {1,2,3}
          8 | 2016-05-28 11:00:10 |     4 |  3 |      2 | {4,5}
    (6 rows)