Search code examples
t-sqlgroup-bysql-server-2014

T-SQL Consider timestamp ordering to determine group


I have an interesting grouping case for which I really can't figure the best way to proceed with. I'm dealing with a lot of data so I'm trying to find the most effective way to perform my query.

Here's an example of data I could have

|Line      |Trip          |Speed        |Timestamp
====================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50
|100       |1             |55 kmh       |2017-06-12 22:36:44
|100       |1             |56 kmh       |2017-06-12 22:37:12
|200       |5             |12 kmh       |2017-06-12 22:40:11
|200       |5             |18 kmh       |2017-06-12 22:43:13
|100       |1             |23 kmh       |2017-06-12 22:49:11
|100       |1             |45 kmh       |2017-06-12 22:53:49

I would like to be able to define a grouping number sequentially based on the Line & Trip ordered by the timestamp. So basically at the end I would expect something such as:

|Line      |Trip          |Speed        |Timestamp            |GroupNumber
===========================================================================
|100       |1             |50 kmh       |2017-06-12 22:34:50  |  1
|100       |1             |55 kmh       |2017-06-12 22:36:44  |  1
|100       |1             |56 kmh       |2017-06-12 22:37:12  |  1
|200       |5             |12 kmh       |2017-06-12 22:40:11  |  2
|200       |5             |18 kmh       |2017-06-12 22:43:13  |  2
|100       |1             |23 kmh       |2017-06-12 22:49:11  |  3
|100       |1             |45 kmh       |2017-06-12 22:53:49  |  3

I can't find any way using DENSE_RANK or ROW_NUMBER because the last group of 100/1 is merged again with the first records but should not because there is another group (200/5) between both occurence.

Any help would be appreciated. Thank you.


Solution

  • This is known as a Gaps and Islands problem.

    To identify your islands, you need to use two ranking functions, the first to find the position of a row within the set, and the second to find the position of the row within its subset (Line, Trip), So you would end up with:

    SELECT  Line,
            Trip,
            Speed,
            Timestamp,
            R1 = ROW_NUMBER() OVER(ORDER BY Timestamp),
            R2 = ROW_NUMBER() OVER(PARTITION BY Line, Trip ORDER BY Timestamp)
    FROM    dbo.YourTable;
    
    
    |Line      |Trip          |Speed        |Timestamp              R1      R2
    ==========================================================================
    |100       |1             |50 kmh       |2017-06-12 22:34:50    1       1
    |100       |1             |55 kmh       |2017-06-12 22:36:44    2       2
    |100       |1             |56 kmh       |2017-06-12 22:37:12    3       3
    |200       |5             |12 kmh       |2017-06-12 22:40:11    4       1   <-- starts new sequence for new group
    |200       |5             |18 kmh       |2017-06-12 22:43:13    5       2
    |100       |1             |23 kmh       |2017-06-12 22:49:11    6       4   <-- follows on from where it left off
    |100       |1             |45 kmh       |2017-06-12 22:53:49    7       5
    

    Now, if you deduct one from the other you get a unique identifier for each island

    |Line      |Trip          |Speed        |Timestamp              R1      R2  (R1 - R2)
    =======================================================================================
    |100       |1             |50 kmh       |2017-06-12 22:34:50    1       1       0
    |100       |1             |55 kmh       |2017-06-12 22:36:44    2       2       0
    |100       |1             |56 kmh       |2017-06-12 22:37:12    3       3       0
    |200       |5             |12 kmh       |2017-06-12 22:40:11    4       1       3
    |200       |5             |18 kmh       |2017-06-12 22:43:13    5       2       3
    |100       |1             |23 kmh       |2017-06-12 22:49:11    6       4       2   
    |100       |1             |45 kmh       |2017-06-12 22:53:49    7       5       2
    

    Finally, you can use this unique identifier to get a starting time for each group:

    SELECT  Line,
            Trip,
            Speed,
            Timestamp,
            GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID)
    FROM    (   SELECT  Line,
                        Trip,
                        Speed,
                        Timestamp,
                        IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
                                    ROW_NUMBER() OVER(PARTITION BY Line, Trip 
                                                        ORDER BY Timestamp)
                FROM    dbo.YourTable
            ) AS t;
    

    Then finally, you can apply DENSE_RANK() to the group start time to get an integer ranking. So with your sample data you would get:

    -- SAMPLE DATA
    DECLARE @T TABLE (Line INT, Trip INT, Speed VARCHAR(6), Timestamp DATETIME);
    INSERT @T (Line, Trip, Speed, Timestamp)
    VALUES
        (100, 1, '50 kmh', '2017-06-12 22:34:50'),
        (100, 1, '55 kmh', '2017-06-12 22:36:44'),
        (100, 1, '56 kmh', '2017-06-12 22:37:12'),
        (200, 5, '12 kmh', '2017-06-12 22:40:11'),
        (200, 5, '18 kmh', '2017-06-12 22:43:13'),
        (100, 1, '23 kmh', '2017-06-12 22:49:11'),
        (100, 1, '45 kmh', '2017-06-12 22:53:49');
    
    WITH GroupedData AS
    (   SELECT  Line,
                Trip,
                Speed,
                Timestamp,
                GroupStart = MIN(Timestamp) OVER(PARTITION BY Line, Trip, IslandID),
                IslandID
        FROM    (   SELECT  Line,
                            Trip,
                            Speed,
                            Timestamp,
                            IslandID = ROW_NUMBER() OVER(ORDER BY Timestamp) -
                                        ROW_NUMBER() OVER(PARTITION BY Line, Trip 
                                                            ORDER BY Timestamp)
                    FROM    @T
                ) AS t
    ) 
    SELECT  Line,
            Trip,
            Speed,
            Timestamp,
            GroupNumber = DENSE_RANK() OVER(ORDER BY GroupStart, IslandID)
    FROM    GroupedData
    ORDER BY Timestamp;
    

    OUTPUT

    Line    Trip    Speed   Timestamp                   GroupNumber
    --------------------------------------------------------------
    100     1       50 kmh  2017-06-12 22:34:50.000     1
    100     1       55 kmh  2017-06-12 22:36:44.000     1
    100     1       56 kmh  2017-06-12 22:37:12.000     1
    200     5       12 kmh  2017-06-12 22:40:11.000     2
    200     5       18 kmh  2017-06-12 22:43:13.000     2
    100     1       23 kmh  2017-06-12 22:49:11.000     3
    100     1       45 kmh  2017-06-12 22:53:49.000     3