I have data on group session attendance by topic, in which one row is one user attending a session on a given date/time. A group offering can have a set number of sessions, e.g. 10, and they are usually offered at the same time (StartTime) and day of week (DayOfWeek).
I would like to identify start and end dates of the group offering, that is the first session and the last session for a given cohort. A cohort would be determined from the list of the same users who have attended the group sessions.
Table: ---GroupSessions---
GroupSessionDate UserID Topic StartTime DayOfWeek ArrivalStatus
Jan-08-2015 1 A 11:30:00 AM Thursday Arrived
Jan-08-2015 2 A 11:30:00 AM Thursday Arrived
Jan-08-2015 3 A 11:30:00 AM Thursday Arrived
Jan-08-2015 4 A 11:30:00 AM Thursday Arrived
Jan-15-2015 1 A 11:30:00 AM Thursday Arrived
Jan-15-2015 2 A 11:30:00 AM Thursday Arrived
Jan-15-2015 3 A 11:30:00 AM Thursday Arrived
Jan-15-2015 4 A 11:30:00 AM Thursday Arrived
Jan-22-2015 1 A 11:30:00 AM Thursday Arrived
Jan-22-2015 2 A 11:30:00 AM Thursday Arrived
Jan-22-2015 3 A 11:30:00 AM Thursday Arrived
Jan-22-2015 4 A 11:30:00 AM Thursday Missed
May-15-2015 5 A 09:00:00 AM Friday Arrived
May-15-2015 2 A 09:00:00 AM Friday Arrived
May-15-2015 6 A 09:00:00 AM Friday Arrived
May-22-2015 5 A 09:00:00 AM Friday Arrived
May-22-2015 6 A 09:00:00 AM Friday Arrived
May-22-2015 2 A 09:00:00 AM Friday Missed
May-29-2015 5 A 09:00:00 AM Friday Arrived
May-29-2015 6 A 09:00:00 AM Friday Arrived
May-29-2015 2 A 09:00:00 AM Friday Missed
In the example above, there are 2 cohorts. Cohort 1 would be comprised of users 1,2,3,and 4 and nearly all of whom have attended the group offering (topic A) between Jan-08-2015 and Jan-22-2015. The same users attended Jan-15-2015 session as the Jan-08-2015 session and nearly all also attended Jan-22-2015 session.
The Cohort 2 (also for topic A) is comprised of users 2,5,6, with offering dates May 15 to 29th, 2015.
The Number of sessions is not set per offering, as it can change depending on demand, so I cannot factor in the number of sessions from the offering date.
I have looked at the Oracle/SQL: Split two inter-related lists into independent cohorts but the question remains unanswered.
Normally I would do the inspection visually and assign users to cohorts, but I have tens of thousands of rows, and was hoping there was a more efficient way to do this with SQL. I am running MSSQL 2014.
I have tried using OUTER APPLY with the table on itself but I am not really getting the result I am looking for.
Could you point me in the right direction please?
SQL:
SELECT src.UserID
,src.GroupSessionDate
,src.StartTime
,src.DayofWeek
,src.Topic
,prevsessdata.GroupSessionDate
FROM GroupSessions src OUTER APPLY
(SELECT TOP 1 * GroupSessions prevsd WHERE src.Topic=prevsd.Topic
AND src.UserID=prevsd.UserID AND src.DayOfWeek=prevsd.DayOfWeek
AND src.StartTime=prevsd.StartTime
AND prevsd.GroupSessionDate<src.GroupSessionDate) prevsessdata
EDIT: Assumptions that can be made:
Desired output: For each Topic, StartTime, and DoW, list Cohort Start/End Dates.
Thanks!
I think you're looking for something like this:
; WITH T AS (
SELECT CONVERT(DATE, REPLACE(GroupSessionDate, '-', ' ')) GroupSessionDate, Topic, StartTime, [DayOfWeek]
, STUFF((SELECT ', ' + CONVERT(NVARCHAR, UserID)
FROM GroupSessions
WHERE GroupSessionDate = G.GroupSessionDate
AND Topic = G.Topic
AND StartTime = G.StartTime
AND [DayOfWeek] = G.[DayOfWeek]
ORDER BY UserID
FOR XML PATH ('')), 1, 2, '') [Cohort]
FROM GroupSessions G
GROUP BY GroupSessionDate, Topic, StartTime, [DayOfWeek])
SELECT Cohort, MIN(GroupSessionDate) SessionStartDate, MAX(GroupSessionDate) SessionEndDate, Topic, StartTime, [DayOfWeek]
FROM T
GROUP BY Topic, StartTime, [DayOfWeek], Cohort, DATEDIFF(dd, 0, GroupSessionDate) % 7
ORDER BY MIN(GroupSessionDate)
A simple group by statement. Really, from what it looks like you're asking, all you need to do is group by topic, starttime, dayofweek and groupsessiondate to find each "cohort" then sort through that again to find the minimum / maximum dates for each of these.
If this isn't producing the right results, there's probably some other criterion you need to use to narrow it down a bit further.