Search code examples
sql-serversql-server-2005sumoverlapping

SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value


In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities. I have a database view that unions data from all of these tables giving us the ActivityView that looks like this. As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.

+----------------------+---------------+---------------------+-------------------+
| activity_client_id   | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
|                  112 | Interview     | 2015-06-01 09:00    | 2015-06-01 11:00  |
|                  112 | CV updating   | 2015-06-01 09:30    | 2015-06-01 11:30  |
|                  112 | Course        | 2015-06-02 09:00    | 2015-06-02 16:00  |
|                  112 | Interview     | 2015-06-03 09:00    | 2015-06-03 10:00  |
+----------------------+---------------+---------------------+-------------------+

Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:

+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
|       112 | 2015-05-20          |
+-----------+---------------------+

I need to create a report that will show the following columns:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+

We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible. So this report will tell us how long from sign up does it take each client to achieve this figure.

What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:

In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30. For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.

So the report for our sample client would give the following result:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
|       112 | 2015-05-20          | 2015-06-02                                 |
+-----------+---------------------+--------------------------------------------+

So my question is how can I create the report using a select statement ? I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table. But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.

I am using SQL Server 2005.


Solution

  • This is one way to do it:

    ;WITH CTErn AS (
       SELECT activity_client_id, activity_type,
              activity_start_date, activity_end_date,
              ROW_NUMBER() OVER (PARTITION BY activity_client_id 
                                 ORDER BY activity_start_date) AS rn
       FROM activities
    ),   
    CTEdiff AS (
       SELECT c1.activity_client_id, c1.activity_type,
              x.activity_start_date, c1.activity_end_date,
              DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
              ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id 
                                 ORDER BY x.activity_start_date) AS seq
       FROM CTErn AS c1
       LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
       CROSS APPLY (SELECT CASE 
                              WHEN c1.activity_start_date < c2.activity_end_date
                                 THEN c2.activity_end_date
                              ELSE c1.activity_start_date
                           END) x(activity_start_date)    
    )
    SELECT TOP 1 client_id, client_sign_up_date, activity_start_date, 
                 hoursOfActivicty               
    FROM CTEdiff AS c1
    INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id                     
    CROSS APPLY (SELECT SUM(diff) / 60.0
                 FROM CTEdiff AS c3
                 WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
    WHERE hoursOfActivicty >= 5
    ORDER BY seq
    

    Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.

    Demo here

    The first CTE, i.e. CTErn, produces the following output:

    client_id   activity_type   start_date          end_date          rn
    112         Interview       2015-06-01 09:00    2015-06-01 11:00  1
    112         CV updating     2015-06-01 09:30    2015-06-01 11:30  2
    112         Course          2015-06-02 09:00    2015-06-02 16:00  3
    112         Interview       2015-06-03 09:00    2015-06-03 10:00  4
    

    The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:

    client_id activity_type start_date       end_date         diff  seq
    112       Interview     2015-06-01 09:00 2015-06-01 11:00 120   1
    112       CV updating   2015-06-01 11:00 2015-06-01 11:30 30    2
    112       Course        2015-06-02 09:00 2015-06-02 16:00 420   3
    112       Interview     2015-06-03 09:00 2015-06-03 10:00 60    4
    

    The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.

    The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.