Search code examples
sqlloopssaslag

SAS do loop + lag function?


This is my first post, so please let me know if I'm not clear enough. Here's what I'm trying to do - this is my dataset. My approach for this is a do loop with a lag but the result is rubbish.

data a;
input @1 obs @4 mindate mmddyy10. @15 maxdate mmddyy10.;
format mindate maxdate date9.;
datalines;
1   01/02/2013 01/05/2013
2   01/02/2013 01/05/2013
3   01/02/2013 01/05/2013
4   01/03/2013 01/06/2013
5   02/02/2013 02/08/2013
6   02/02/2013 02/08/2013
7   02/02/2013 02/08/2013
8   03/10/2013 03/11/2013
9   04/02/2013 04/22/2013
10  04/10/2013 04/22/2013
11  05/04/2013 05/07/2013
12  06/10/2013 06/20/2013
;
run;

Now, I'm trying to produce a new column - "Replacement" based on the following logic:

  1. If a record's mindate occurs before its lag's maxdate, it cannot be a replacement for it. If it cannot be a replacement, skip forward (so- 2,3,4 cannot replace 1, but 5 can).
  2. Otherwise... if the mindate is less than 30 days, Replacement = Y. If not, replacement = N. Once a record replaces another (so, in this case, 5 does replace 1, because 02/02/2013 is <30 than 01/05/2013, it cannot duplicate as a replacement for another record. But if it's an N for one record above, it can still be a Y for some other record. So, 6 is now evaluated against 2, 7 against 3,etc. Since those two combos are both "Y", 8 is now evaluated versus 4, but because its mindate >30 relative to 4's maxdate, it's a N. But, it's then evaluated against against
  3. And so on...

I should that in a 100 record dataset, this would imply that the 100th record could technically replace the 1st, so I've been trying lags within loops. Any tips/help is greatly appreciated! Expected output:

                      obs      mindate      maxdate    Replacement

                        1    02JAN2013    05JAN2013
                        2    02JAN2013    05JAN2013
                        3    02JAN2013    05JAN2013
                        4    03JAN2013    06JAN2013
                        5    02FEB2013    08FEB2013         Y
                        6    02FEB2013    08FEB2013         Y
                        7    02FEB2013    08FEB2013         Y
                        8    10MAR2013    11MAR2013         Y
                        9    02APR2013    22APR2013         Y
                       10    10APR2013    22APR2013         N
                       11    04MAY2013    07MAY2013         Y
                       12    10JUN2013    20JUN2013         Y

Solution

  • I think this is correct if the asker was mistaken about replacement = Y for obs = 12.

    /*Get number of obs so we can build a temporary array to hold the dataset*/
    data _null_;
        set have nobs= nobs;
        call symput("nobs",nobs);
        stop;
    run;
    
    data want;
        /*Load the dataset into a temporary array*/
        array dates[2,&NOBS] _temporary_;
        if _n_ = 1 then do _n_ = 1 by 1 until(eof);
            set have end = eof;
            dates[1,_n_] = maxdate;
            dates[2,_n_] = 0;
        end;
    
        set have;
    
        length replacement $1;
    
        replacement = 'N';
        do i = 1 to _n_ - 1 until(replacement = 'Y');
            if dates[2,i] = 0 and 0 <= mindate - dates[1,i] <= 30 then do;
                replacement = 'Y';
                dates[2,i] = _n_;
                replaces = i;
            end;
        end;
        drop i; 
    run;
    

    You could use a hash object + hash iterator instead of a temporary array if you preferred. I've also included an extra var, replaces, to show which previous row each row replaces.