I'm doing some trajectory analysis using R and PostgreSQL. In order to form groups of trajectory segments where successive positions are spatio-temporally near, I've created the following table. What I'm still missing is the column group_id
, which is what my question is about.
bike_id1 datetime bike_id2 near group_id
1 2016-05-28 11:00:00 2 TRUE 1
1 2016-05-28 11:00:05 2 TRUE 1
1 2016-05-28 11:00:10 2 FALSE NA
[...]
2 2016-05-28 11:00:05 3 TRUE 1
2 2016-05-28 11:00:10 3 TRUE 1
This is the result of multiple comparisons between each trajectory with every other (all combinations without repetitions) and an inner join on datetime
(sampled always on a multiple of 5 seconds). It shows that for certain positions, bike 1 and 2 were sampled at the same time and are spatially near (some arbitrary threshold).
Now I'd like to give away unique ids for the segments where two bikes are spatio-temporally near (group_id
). This is where I'm stuck: I'd want the group_id
to respect groups with multiple trajectories. The method for assigning the group_id
should realize that if bike 1 and 2 are in a group at 2016-05-28 11:00:05
, then 3 belongs to the same group if it is near to 2 at that same timestamp (2016-05-28 11:00:05
).
Are there tools within R or PostgreSQL that would help me with this task? Running a loop through the table seems like the wrong way to go about this.
EDIT: As @wildplasser pointed out, this seems to be a gaps-and-islands problem which is traditionally solved using SQL. He has kindly produced some sample data that I have slightly extended and will include in the question.
CREATE TABLE nearness
-- ( seq SERIAL NOT NULL UNIQUE -- surrogate for conveniance
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)
,(4,5, '2016-05-28 11:00:00', FALSE)
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
UPDATE: [after understanding the real question ;-] Finding the equivalence groups of bikes (set, bike_set) is in fact a relational-division problem. Finding the begin and end of the segments (clust) within a set of bikes is basically the same as in the first attempt.
uniq
CTE)NOTE: the code trusts on (bike2 > bike1)
. This is needed to keep the array sorted and thus canonical. The actual content is not guaranteed to be canonical because the order of addition in the recursive query cannot be guaranteed. This may need some extra work.
CREATE TABLE nearness
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE) -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE) -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
-- Recursive union-find to glue together sets of bike_ids
-- ,occuring at the same moment.
-- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
WITH omg AS (
SELECT bike1 ,bike2,stamp
, row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
, ARRAY[bike1,bike2]::integer[] AS arr
FROM nearness n WHERE near = True
)
-- Find all existing combinations of bikes
SELECT o1.stamp, o1.seq
, ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
FROM omg o1
UNION ALL
SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
, CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
ELSE w.arr || o2.bike1 END AS arr
FROM omg o2
JOIN wood w
ON o2.stamp = w.stamp AND o2.seq > w.seq
AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
)
, uniq AS ( -- suppress partial sets caused by the recursive union-find buildup
SELECT * FROM wood w
WHERE NOT EXISTS (SELECT * FROM wood nx
WHERE nx.stamp = w.stamp
AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal
)
)
, xsets AS ( -- make unique sets of bikes
SELECT DISTINCT arr
-- , MIN(seq) AS grp
FROM uniq
GROUP BY arr
)
, sets AS ( -- enumerate the sets of bikes
SELECT arr
, row_number() OVER () AS setnum
FROM xsets
)
, drag AS ( -- Detect beginning and end of segments of consecutive observations
SELECT u.* -- within a constant set of bike_ids
-- Edge-detection begin of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp < u.stamp
AND nx.stamp >= u.stamp - '5 sec'::interval
) AS is_first
-- Edge-detection end of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp > u.stamp
AND nx.stamp <= u.stamp + '5 sec'::interval
) AS is_last
, row_number() OVER(ORDER BY arr,stamp) AS nseq
FROM uniq u
)
, top AS ( -- id and groupnum for the start of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_first
)
, bot AS ( -- id and groupnum for the end of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_last
)
SELECT w.seq as orgseq -- results, please ...
, w.stamp
, g0.clust AS clust
, row_number() OVER(www) AS rn
, s.setnum, s.arr AS bike_set
FROM drag w
JOIN sets s ON s.arr = w.arr
JOIN top g0 ON g0.nseq <= w.seq
JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
ORDER BY g1.clust, w.stamp
;
Result:
orgseq | stamp | clust | rn | setnum | bike_set
--------+---------------------+-------+----+--------+----------
1 | 2016-05-28 11:00:00 | 1 | 1 | 1 | {1,2}
4 | 2016-05-28 11:00:20 | 3 | 1 | 1 | {1,2}
5 | 2016-05-28 11:00:25 | 3 | 2 | 1 | {1,2}
6 | 2016-05-28 11:00:05 | 4 | 1 | 3 | {1,2,3}
7 | 2016-05-28 11:00:10 | 4 | 2 | 3 | {1,2,3}
8 | 2016-05-28 11:00:10 | 4 | 3 | 2 | {4,5}
(6 rows)