Search code examples
sql-serversplittime-and-attendance

MSSQL: identify Cohorts from Session Attendance data


I have data on group session attendance by topic, in which one row is one user attending a session on a given date/time. A group offering can have a set number of sessions, e.g. 10, and they are usually offered at the same time (StartTime) and day of week (DayOfWeek).

I would like to identify start and end dates of the group offering, that is the first session and the last session for a given cohort. A cohort would be determined from the list of the same users who have attended the group sessions.

Table: ---GroupSessions--- GroupSessionDate UserID Topic StartTime DayOfWeek ArrivalStatus Jan-08-2015 1 A 11:30:00 AM Thursday Arrived Jan-08-2015 2 A 11:30:00 AM Thursday Arrived Jan-08-2015 3 A 11:30:00 AM Thursday Arrived Jan-08-2015 4 A 11:30:00 AM Thursday Arrived Jan-15-2015 1 A 11:30:00 AM Thursday Arrived Jan-15-2015 2 A 11:30:00 AM Thursday Arrived Jan-15-2015 3 A 11:30:00 AM Thursday Arrived Jan-15-2015 4 A 11:30:00 AM Thursday Arrived Jan-22-2015 1 A 11:30:00 AM Thursday Arrived Jan-22-2015 2 A 11:30:00 AM Thursday Arrived Jan-22-2015 3 A 11:30:00 AM Thursday Arrived Jan-22-2015 4 A 11:30:00 AM Thursday Missed May-15-2015 5 A 09:00:00 AM Friday Arrived May-15-2015 2 A 09:00:00 AM Friday Arrived May-15-2015 6 A 09:00:00 AM Friday Arrived May-22-2015 5 A 09:00:00 AM Friday Arrived May-22-2015 6 A 09:00:00 AM Friday Arrived May-22-2015 2 A 09:00:00 AM Friday Missed May-29-2015 5 A 09:00:00 AM Friday Arrived May-29-2015 6 A 09:00:00 AM Friday Arrived May-29-2015 2 A 09:00:00 AM Friday Missed

In the example above, there are 2 cohorts. Cohort 1 would be comprised of users 1,2,3,and 4 and nearly all of whom have attended the group offering (topic A) between Jan-08-2015 and Jan-22-2015. The same users attended Jan-15-2015 session as the Jan-08-2015 session and nearly all also attended Jan-22-2015 session.

The Cohort 2 (also for topic A) is comprised of users 2,5,6, with offering dates May 15 to 29th, 2015.

The Number of sessions is not set per offering, as it can change depending on demand, so I cannot factor in the number of sessions from the offering date.

I have looked at the Oracle/SQL: Split two inter-related lists into independent cohorts but the question remains unanswered.

Normally I would do the inspection visually and assign users to cohorts, but I have tens of thousands of rows, and was hoping there was a more efficient way to do this with SQL. I am running MSSQL 2014.

I have tried using OUTER APPLY with the table on itself but I am not really getting the result I am looking for.

Could you point me in the right direction please?

SQL:

SELECT src.UserID
  ,src.GroupSessionDate
  ,src.StartTime
  ,src.DayofWeek
  ,src.Topic
  ,prevsessdata.GroupSessionDate
  FROM GroupSessions src OUTER APPLY
  (SELECT TOP 1 * GroupSessions prevsd WHERE src.Topic=prevsd.Topic
  AND src.UserID=prevsd.UserID AND src.DayOfWeek=prevsd.DayOfWeek
  AND src.StartTime=prevsd.StartTime 
  AND prevsd.GroupSessionDate<src.GroupSessionDate) prevsessdata

EDIT: Assumptions that can be made:

  • sessions generally happen 1 week apart (exception: holidays)
  • all sessions in the same cohort would have the same topic, be on same day of the week, and same start time
  • users can stop attending but they would be listed in the sessions they missed - (sorry I missed this crucial detail in the original question!)
  • same users will generally be listed for all the sessions of group offering, although users can attend groups on multiple topics, so there may be instances where a user is listed also as being part of another cohort (here user 2 attended offering in Jan and in May). But this would be limited to 1 or 2 users of 10-12 in a cohort.

Desired output: For each Topic, StartTime, and DoW, list Cohort Start/End Dates.

Thanks!


Solution

  • I think you're looking for something like this:

    ; WITH T AS (
        SELECT CONVERT(DATE, REPLACE(GroupSessionDate, '-', ' ')) GroupSessionDate, Topic, StartTime, [DayOfWeek]
            , STUFF((SELECT ', ' + CONVERT(NVARCHAR, UserID)
                FROM GroupSessions
                WHERE GroupSessionDate = G.GroupSessionDate
                AND Topic = G.Topic
                AND StartTime = G.StartTime
                AND [DayOfWeek] = G.[DayOfWeek]
                ORDER BY UserID
                FOR XML PATH ('')), 1, 2, '') [Cohort]
        FROM GroupSessions G
        GROUP BY GroupSessionDate, Topic, StartTime, [DayOfWeek])
    SELECT Cohort, MIN(GroupSessionDate) SessionStartDate, MAX(GroupSessionDate) SessionEndDate, Topic, StartTime, [DayOfWeek]
    FROM T
    GROUP BY Topic, StartTime, [DayOfWeek], Cohort, DATEDIFF(dd, 0, GroupSessionDate) % 7
    ORDER BY MIN(GroupSessionDate)
    

    A simple group by statement. Really, from what it looks like you're asking, all you need to do is group by topic, starttime, dayofweek and groupsessiondate to find each "cohort" then sort through that again to find the minimum / maximum dates for each of these.

    If this isn't producing the right results, there's probably some other criterion you need to use to narrow it down a bit further.