I have an interesting grouping case for which I really can't figure the best way to proceed with. I'm dealing with a lot of data so I'm trying to find the most effective way to perform my query.
Here's an example of data I could have
|Line |Trip |Speed |Timestamp
====================================================
|100 |1 |50 kmh |2017-06-12 22:34:50
|100 |1 |55 kmh |2017-06-12 22:36:44
|100 |1 |56 kmh |2017-06-12 22:37:12
|200 |5 |12 kmh |2017-06-12 22:40:11
|200 |5 |18 kmh |2017-06-12 22:43:13
|100 |1 |23 kmh |2017-06-12 22:49:11
|100 |1 |45 kmh |2017-06-12 22:53:49
I would like to be able to define a grouping number sequentially based on the Line & Trip ordered by the timestamp. So basically at the end I would expect something such as:
|Line |Trip |Speed |Timestamp |GroupNumber
===========================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 | 1
|100 |1 |55 kmh |2017-06-12 22:36:44 | 1
|100 |1 |56 kmh |2017-06-12 22:37:12 | 1
|200 |5 |12 kmh |2017-06-12 22:40:11 | 2
|200 |5 |18 kmh |2017-06-12 22:43:13 | 2
|100 |1 |23 kmh |2017-06-12 22:49:11 | 3
|100 |1 |45 kmh |2017-06-12 22:53:49 | 3
I can't find any way using DENSE_RANK or ROW_NUMBER because the last group of 100/1 is merged again with the first records but should not because there is another group (200/5) between both occurence.
Any help would be appreciated. Thank you.
This is known as a Gaps and Islands problem.
To identify your islands, you need to use two ranking functions, the first to find the position of a row within the set, and the second to find the position of the row within its subset (Line, Trip), So you would end up with:
SELECT Line,
Trip,
Speed,
Timestamp,
R1 = ROW_NUMBER() OVER(ORDER BY Timestamp),
R2 = ROW_NUMBER() OVER(PARTITION BY Line, Trip ORDER BY Timestamp)
FROM dbo.YourTable;
|Line |Trip |Speed |Timestamp R1 R2
==========================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 1 1
|100 |1 |55 kmh |2017-06-12 22:36:44 2 2
|100 |1 |56 kmh |2017-06-12 22:37:12 3 3
|200 |5 |12 kmh |2017-06-12 22:40:11 4 1 <-- starts new sequence for new group
|200 |5 |18 kmh |2017-06-12 22:43:13 5 2
|100 |1 |23 kmh |2017-06-12 22:49:11 6 4 <-- follows on from where it left off
|100 |1 |45 kmh |2017-06-12 22:53:49 7 5
Now, if you deduct one from the other you get a unique identifier for each island
|Line |Trip |Speed |Timestamp R1 R2 (R1 - R2)
=======================================================================================
|100 |1 |50 kmh |2017-06-12 22:34:50 1 1 0
|100 |1 |55 kmh |2017-06-12 22:36:44 2 2 0
|100 |1 |56 kmh |2017-06-12 22:37:12 3 3 0
|200 |5 |12 kmh |2017-06-12 22:40:11 4 1 3
|200 |5 |18 kmh |2017-06-12 22:43:13 5 2 3
|100 |1 |23 kmh |2017-06-12 22:49:11 6 4 2
|100 |1 |45 kmh |2017-06-12 22:53:49 7 5 2
Finally, you can use this unique identifier to get a starting time for each group:
SELECT Line,
Trip,
Speed,
Timestamp,
GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID)
FROM ( SELECT Line,
Trip,
Speed,
Timestamp,
IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
ROW_NUMBER() OVER(PARTITION BY Line, Trip
ORDER BY Timestamp)
FROM dbo.YourTable
) AS t;
Then finally, you can apply DENSE_RANK()
to the group start time to get an integer ranking. So with your sample data you would get:
-- SAMPLE DATA
DECLARE @T TABLE (Line INT, Trip INT, Speed VARCHAR(6), Timestamp DATETIME);
INSERT @T (Line, Trip, Speed, Timestamp)
VALUES
(100, 1, '50 kmh', '2017-06-12 22:34:50'),
(100, 1, '55 kmh', '2017-06-12 22:36:44'),
(100, 1, '56 kmh', '2017-06-12 22:37:12'),
(200, 5, '12 kmh', '2017-06-12 22:40:11'),
(200, 5, '18 kmh', '2017-06-12 22:43:13'),
(100, 1, '23 kmh', '2017-06-12 22:49:11'),
(100, 1, '45 kmh', '2017-06-12 22:53:49');
WITH GroupedData AS
( SELECT Line,
Trip,
Speed,
Timestamp,
GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID),
IslandID
FROM ( SELECT Line,
Trip,
Speed,
Timestamp,
IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
ROW_NUMBER() OVER(PARTITION BY Line, Trip
ORDER BY Timestamp)
FROM @T
) AS t
)
SELECT Line,
Trip,
Speed,
Timestamp,
GroupNumber = DENSE_RANK() OVER(ORDER BY GroupStart, IslandID)
FROM GroupedData
ORDER BY Timestamp;
OUTPUT
Line Trip Speed Timestamp GroupNumber
--------------------------------------------------------------
100 1 50 kmh 2017-06-12 22:34:50.000 1
100 1 55 kmh 2017-06-12 22:36:44.000 1
100 1 56 kmh 2017-06-12 22:37:12.000 1
200 5 12 kmh 2017-06-12 22:40:11.000 2
200 5 18 kmh 2017-06-12 22:43:13.000 2
100 1 23 kmh 2017-06-12 22:49:11.000 3
100 1 45 kmh 2017-06-12 22:53:49.000 3