Search code examples
sqlsql-serverdatetimewindow-functionsrecursive-query

Partitioning rows into groups by accumulative time interval


I got a search sessions log that looks like this:

+----------+-------------------------+----------+
|    dt    |       search_time       | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 |        1 |
+----------+-------------------------+----------+

My task is to partition each row into session groups. Session groups are up to five minutes.

For example:

Those TOP 3 sessions will form a group session 1 - if we accumulate the minutes between each row, we will get 3 minutes and the 4th would accumulate to more then 5 minutes so it will be a different session group.

+----------+-------------------------+----------+---------------+
|    dt    |       search_time       | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |             1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |             2 |
+----------+-------------------------+----------+---------------+

I manipulated the table like this in order to get it ready for partitioning:

WITH [Sub Table] AS
(

SELECT   [dt]
        ,[search_time]
        ,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
        ,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
        ,[searches]
FROM     [search_session]
)
SELECT
    [dt],
    [search_time],
    [pervious search time],
    [min diff],
    [searches]
FROM [Sub Table]

And got this:

+----------+-------------------------+-------------------------+----------+----------+
|    dt    |       search_time       |  pervious search time   | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL                    |        0 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 |       37 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 |      160 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 |      100 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 |      105 |        1 |
+----------+-------------------------+-------------------------+----------+----------+

I thought about two possibilities to continue:

  1. Using a window function, like RANK(), I can partition the rows, but I can't figure out how to set the PARTITION BY caluse with a condition to do so.

  2. To iterate the table with a WHILE loop - again finding hard time to form ths


Solution

  • This cannot be done with just window functions. You need some kind of iterative process, that keeps track of the first row of each group, and dynamically identififes the next one.

    In SQL, you can phrase this with a recursive query:

    with 
        data as (select t.*, row_number() over(order by search_time) rn from mytable t),
        cte as (
            select d.*, search_time as first_search_time 
            from data d
            where rn = 1
            union all 
            select d.*, 
                case when d.search_time > dateadd(minute, 5, c.first_search_time)
                    then d.search_time
                    else c.first_search_time
                end
            from cte c 
            inner join data d on d.rn = c.rn + 1
        )
    select c.*, dense_rank() over(order by first_search_time) grp 
    from cte c
    

    For your sample data, this returns:

    dt         | search_time             | searches | rn | first_search_time       | grp
    :--------- | :---------------------- | -------: | -: | :---------------------- | --:
    2020-06-01 | 2020-06-01 00:36:38.000 |        1 |  1 | 2020-06-01 00:36:38.000 |   1
    2020-06-01 | 2020-06-01 00:37:38.000 |        1 |  2 | 2020-06-01 00:36:38.000 |   1
    2020-06-01 | 2020-06-01 00:39:18.000 |        1 |  3 | 2020-06-01 00:36:38.000 |   1
    2020-06-01 | 2020-06-01 01:16:18.000 |        1 |  4 | 2020-06-01 01:16:18.000 |   2
    2020-06-01 | 2020-06-01 03:56:38.000 |        1 |  5 | 2020-06-01 03:56:38.000 |   3
    2020-06-01 | 2020-06-01 05:36:38.000 |        1 |  6 | 2020-06-01 05:36:38.000 |   4
    2020-06-01 | 2020-06-01 05:37:38.000 |        1 |  7 | 2020-06-01 05:36:38.000 |   4
    2020-06-01 | 2020-06-01 05:39:38.000 |        1 |  8 | 2020-06-01 05:36:38.000 |   4
    2020-06-01 | 2020-06-01 05:41:38.000 |        1 |  9 | 2020-06-01 05:36:38.000 |   4
    2020-06-01 | 2020-06-01 07:26:38.000 |        1 | 10 | 2020-06-01 07:26:38.000 |   5