Search code examples
mysqldatecountoverlapping

Count days within dateranges while excluding overlapping days


I'm looking for the number of days within several dateranges. I used the datediff function to sum the days, but now I want to exclude overlapping days. So from the oldest date untill curdate I would like to have the number of days within the dateranges and every day only counted once if it is in an overlapping range.

My table looks like this:

Person_id |      Start_date      | End_date              | Count
 83244       2014-09-01 00:00:00    2014-09-06 00:00:00    5
 83244       2014-09-08 00:00:00    2015-09-07 00:00:00    364
 83244       2015-01-15 00:00:00    2015-02-01 00:00:00    17

If I sum this I would get 382, but the answer I'm looking for is 369. Since the last row is completely overlapping the second row.

Does anyone have a solution?


Solution

  • I have padded out your example with a second Person_id and shortened the column names a bit to make the code a little shorter:

    CREATE TABLE tbl(`pid` int, `sd` datetime, `ed` datetime);
    INSERT INTO tbl (`pid`, `sd`, `ed`)
    VALUES
        (83244, '2014-09-01', '2014-09-06'),
        (83244, '2014-09-08', '2015-09-07'),
        (83243, '2014-08-08', '2015-08-15'),
        (83243, '2014-08-11', '2015-09-03'),
        (83244, '2015-01-15', '2015-02-01');
    

    So, working on the above data you could apply the following query:

    SELECT pid,sd,ed,CASE WHEN @id!=pid THEN @id:=pid+0*(@ed:=Date('1970-1-1')) END id, 
           CASE WHEN sd<@ed THEN CASE WHEN ed>@ed THEN datediff(ed,@ed) ELSE 0 END 
                            ELSE datediff(ed,sd) END days,
           @ed:=CASE WHEN ed>@ed THEN ed ELSE @ed END enddt
    FROM tbl,( select @id:=0 ) const
    ORDER BY pid,sd
    

    Contrary to other RDBMS MySql has a certain "procedural feel" to it when it comes to select statements. You can actually use variables (@id and @ed) in them that will change their state over time (in this context the order by clause at the end is extremely important).

    The basic idea behind this query is: start with a certain pid and list the intervals in the order of increasing start dates (sd). Always remember the maximum value of the end date (ed) in variable @ed. Now, with each new interval, check whether there is an overlap with the previous interval, i. e. check whether the current start date sd is smaller than the previous (maximum) end date (@ed) and calculate the interval days accordingly.

    The first case clause is necessary to reset the variables @id and @ed whenever the current pid changes.

    Subquery const just sets the variable @id in the beginning.

    The query yields the following result:

      pid   sd                  ed                  id     days enddt
    83243   2014-08-08 00:00:00 2015-08-15 00:00:00 83243   372 2015-08-15 00:00:00
    83243   2014-08-11 00:00:00 2015-09-03 00:00:00          19 2015-09-03 00:00:00
    83244   2014-09-01 00:00:00 2014-09-06 00:00:00 83244     5 2014-09-06 00:00:00
    83244   2014-09-08 00:00:00 2015-09-07 00:00:00         364 2015-09-07 00:00:00
    83244   2015-01-15 00:00:00 2015-02-01 00:00:00           0 2015-09-07 00:00:00 
    

    See here for a Demo.

    If you are just interested in the total sums you can of course wrap the whole query in another grouping one like this:

    SELECT pid,sum(days) FROM (
     SELECT pid,sd,ed,CASE WHEN @id!=pid THEN @id:=pid+0*(@ed:=Date('1970-1-1')) END id, 
            CASE WHEN sd<@ed THEN CASE WHEN ed>@ed THEN datediff(ed,@ed) ELSE 0 END 
                             ELSE datediff(ed,sd) END days,
            @ed:=CASE WHEN ed>@ed THEN ed ELSE @ed END enddt
     FROM tbl,( select @id:=0 ) const
     ORDER BY pid,sd
    ) t GROUP BY pid ORDER BY pid
    

    which will then get you

    pid     sum(days)
    83243   391
    83244   369