Search code examples
sqlsubquerycaseaggregatewindow-functions

SQL: how do you perform aggregations when you can't use PARTITION the column?


Below is an image of the table:

https://i.sstatic.net/mPUGV.png

I have a table that tracks the users' visit of a mobile app. Each row represents the datetime when a user enters a page in the app. Min_btw_page shows the number of minute between each page visit. When Min_btw_page is >= 30 mins, the session is considered to be complete and the next page visit will be counted as a new session. What I am trying to find are:

  1. The number of pages (namely rows) visited per session by each user (HashID);
  2. The average minutes spent per session

I have used the lag() function to create "Min_btw_next_page". I have also created the column "row_no", attempting to give a sequence no. to each row by HashID by session, but I failed. The result should be like the column "Expected_row_no". However, even if I am able to achieve the right row number, I still do not know how to aggregate the rows by session as I would not be able to partition row number.


Solution

  • My understanding of your question is that you want to distinguish 'sessions' for users. You define a new 'session' as one where the user hadn't done anything for more than 30 minutes. As such, if someone does many actions, with 20 minutes or so between each action, it still counts as a single 'session'.

    One approach (definitely not the only approach) would start with a minor change to what you have. Note also that here is only a partial answer - it's preparation for later analyses.

    Note also that

    • It's written in SQL Server - if you use something else, you'll need to review
    • You get quicker and better solutions if you post the data in machine readable form, so we don't have to retype!
    • I've avoided partitioning (except in the first LAG) - as requested. I assume you used a partition in the LAG to get your values though, so I used one there. It does, however, use the SUM(column) OVER (ORDER BY ...) to get a running total.

    Here, what I'm doing is to create a column where all values in a 'session' get the same value e.g., the first six rows in your table get value 1, the next two rows get value 2, the next eight rows get value 3. From there, you can group by to find averages etc, and also doing other things like numbering become trivial.

    Process involves

    • Instead of finding the next VisitDateTime, find the last visit datetime. This is very important, because it enables us to determine (on the one row) if it's a new session or not via a simple DATEDIFF
    • Each row that is a 'new session' is flagged with the value 1, otherwise 0.
    • Then the sessions numbers are created by simply taking a running total of those flags

    Data setup

    CREATE TABLE #DeviceLoads (LogID int IDENTITY(1,1), HashID nvarchar(10), DeviceDatetime datetime);
    INSERT INTO #DeviceLoads (HashID, DeviceDatetime) VALUES
    ('ID1', '20201013 15:26'),
    ('ID1', '20201013 15:26'),
    ('ID1', '20201013 15:28'),
    ('ID1', '20201013 15:28'),
    ('ID1', '20201013 15:28'),
    ('ID1', '20201014 14:59'),
    ('ID1', '20201014 14:59'),
    ('ID1', '20201014 16:17'),
    ('ID1', '20201014 16:46'),
    ('ID1', '20201014 17:15'),
    ('ID1', '20201014 17:46');
    

    Here is the one command (feel free to split it up though).

    • The CTE DL_Source uses the LAG function (similar to what you have, I believe, to create your original table) to determine the last activity time
    • The CTE DL_Session_Source gets the data from above, and flags the new sessions with value 1
    • The final SELECT creates the running total from DL_Session_Source
    WITH DL_source AS   -- This is probably similar to what you have already
        (SELECT  LogID, HashID, DeviceDatetime, LAG(DeviceDatetime, 1) OVER (PARTITION BY HashId ORDER BY DeviceDatetime, LogID) AS Last_DeviceDateTime
           FROM #DeviceLoads),
    DL_Session_Source AS
        (SELECT  LogID, HashID, DeviceDatetime, Last_DeviceDateTime, CASE WHEN DATEDIFF(minute, Last_DeviceDateTime, DeviceDatetime) <= 30 THEN 0 ELSE 1 END AS New_Session_flag
           FROM DL_source)
    SELECT  *, SUM(New_Session_flag) OVER (ORDER BY HashID, DeviceDatetime, LogID) AS Session_Num
      FROM  DL_Session_Source;
    

    And here are the results (with seconds truncated for brevity). Note the column on the end (Session_Num) which indicates which rows are in which session.

    LogID  HashID  DeviceDatetime    Last_DeviceDateTime   New_Session_flag    Session_Num
    1      ID1     2020-10-13 15:26  NULL                  1                   1
    2      ID1     2020-10-13 15:26  2020-10-13 15:26      0                   1
    3      ID1     2020-10-13 15:28  2020-10-13 15:26      0                   1
    4      ID1     2020-10-13 15:28  2020-10-13 15:28      0                   1
    5      ID1     2020-10-13 15:28  2020-10-13 15:28      0                   1
    6      ID1     2020-10-14 14:59  2020-10-13 15:28      1                   2
    7      ID1     2020-10-14 14:59  2020-10-14 14:59      0                   2
    8      ID1     2020-10-14 16:17  2020-10-14 14:59      1                   3
    9      ID1     2020-10-14 16:46  2020-10-14 16:17      0                   3
    10     ID1     2020-10-14 17:15  2020-10-14 16:46      0                   3
    11     ID1     2020-10-14 17:46  2020-10-14 17:15      1                   4
    

    From here, feel free to save to a temporary table or so for further processing e.g.,

    SELECT Session_Num, 
           HashID, 
           COUNT(*) AS Num_Actions, 
           MIN(DeviceDateTime) AS First_Action,  
           MAX(DeviceDateTime) AS Last_Action
    FROM #YourTempTable
    GROUP BY Session_Num, HashID;
    

    Here's a db<>fiddle with some added 'interweaved' data (e.g., out of order and overlapping for HashID ID2) to help ensure it works as required.