Search code examples
sas

Search dates between ranges of periods and count occurrences


suppose to have the following:

data have;
input ID :$20. Start :date9. End :date9.;
format start end ddmmyy9.;
cards;
0001 01JAN2015 30JUN2015 
0001 01JUL2015 01FEB2016 
0001 02FEB2016 11DEC2016 
0001 12DEC2016 06FEB2017 
0001 07FEB2017 31DEC2017 
0002 01JAN2016 31DEC2017 
0002 01JAN2018 01MAR2018
0002 01APR2018 31NOV2018
......................
;          

and a list of dates:

data dates;            
input dates :$20.;              
format dates ddmmyy9.;    
cards; 
01JAN2015 
31DEC2015 
01JAN2016 
31DEC2016 
01JAN2017 
31DEC2017 
01JAN2018 
31DEC2018 
;   

Is there a way to know if, for each ID, each date is in the range? For example: the ID 0001 contains all dates except 01JAN2018 and 31DEC2018. Moreover, for each year I need to count how many IDs start at 01/01 and end at 31/12 so they appear for the entire year. For example, ID 0002 will not be counted for 2018 because it ends before 31/12. Desired output:

ID       01JAN2015  31DEC2015  01JAN2016 31DEC2016 01JAN2017  31DEC2017  01JAN2018   31DEC2018
0001        yes        yes         yes       yes       yes      yes         no         no       
0002         no         no         yes       yes       yes      yes         yes        no       

Final table:

 Year       Count      
 2015         1         
 2016         2
 2017         2       
 2018         0              

To match the dates in the range I tried:

   proc sql;
   create table want as;
   select dates as t1;
   join have as t2;       
   t2.dates between t1.start and t1.end
   order by 1,2;        
   quit;      

Unfortunately I lose the ID correspondence.

Can anyone help me please?

Thank you in advance


Solution

  • The first output can be achieved using proc sql followed by proc transpose and then a data step.
    (The variables are created in a different order than in your desired output - hopefully that is not a problem for you.)

    The second can be done with a data step followed by proc summary.

    NB - I changed the invalid date 31NOV2018 to 30NOV2018 before running my code.

    Output 1

    * First, join the HAVE and DATES data sets to check which dates fall within a range;
    proc sql;
        create table want1 as
            select h.ID, d.dates format=date9., 'yes' as flag
                from dates d left join have h on (d.dates between h.start and h.end)
                order by ID, dates      
                ;
    run;
    
    * transpose this to the wider format required. Cells where there is no match will be blank;
    proc transpose data=want1 out=want1 (where=(ID ne '') drop=_name_); 
        by ID;
        id dates;
        var flag;
    run;
    
    * Now populate the blank cells with "no";
    data want1;
        set want1;
        array datevars (*) _all_;
        do i = 1 to dim(datevars);
            if datevars(i) = '' then datevars(i)='no';
        end;
        drop i;
    run;
    

    Output 2

    
    * This assumes the data are in order by ID, start and end - if not then sort the data before this step;
    * First read throught the HAVE data set and set count = 1 for each year that is fully covered by each ID;
    data want2;
        set have;
        by ID;
        retain first last count; * FIRST and LAST are the earliest START and latest END respectively for each ID;
        if first.ID then do;
            first=start;
            count=0;
        end;
        if last.ID then do;
            last=end;
            do yr=year(first) to year(last); * for each ID we want to cycle through all the years covered;
                if first <= mdy(1,1,yr) and last >= mdy(12,31,yr) then count = 1; * sets count=1 if the year is fully contained within the range;
                output; 
                count=0;    
            end;            
        end;
        drop start end first last;
        format first last date9.;
    run;
    
    * The step above produces a row for each combination of year * ID, we just want totals by year which is done by PROC SUMMARY;
    proc summary data=want2 nway;
        class yr;
        var count;
        output out=want2 (drop=_type_ _freq_) sum=;
    run;