Search code examples
sasdelete-rowretain

Dropping observations after a certain condition is met in SAS


This is an extension of an earlier question. (Drop observations once condition is met by multiple variables).

I have the following data and used one of the existing answered questions to solve my data problem but could not get what I want. Here is what I have in my data

  • Amt1 is populated when the Evt_type is Fee
  • Amt2 is populated when the Evt_type is REF1/REF2
  • I don't want to display any observations after the last Flag='Y'
  • If there is no Flag='Y' then I want all the observations for that id (e.g. id=102)
  • I want to display if the next row for that id is a Fee followed by REF1/REF2 after flag='Y' (e.g. id=101) However I don't want if there is no REF1/REF2 (e.g.id=103)

Have:

   id   Date        Evt_Type   Flag   Amt1   Amt2
  101  2/2/2019      Fee              5
  101  2/3/2019      REF1      Y             5
  101  2/4/2019      Fee              10
  101  2/6/2019      REF2      Y             10
  101  2/7/2019      Fee               4
  101  2/8/2019      REF1
  102  2/2/2019      Fee              25
  102  2/2/2019      REF1      N      25
  103  2/3/2019      Fee              10
  103  2/4/2019      REF1      Y             10
  103  2/5/2019      Fee              10

Want:

  id   Date        Evt_Type   Flag   Amt1   Amt2
 101  2/2/2019      Fee              5
 101  2/3/2019      REF1      Y             5
 101  2/4/2019      Fee              10
 101  2/6/2019      REF2      Y             10
 101  2/7/2019      Fee               4
 101  2/8/2019      REF1
 102  2/2/2019      Fee              25
 102  2/2/2019      REF1      N      25
 103  2/3/2019      Fee              10
 103  2/4/2019      REF1      Y             10

I tried the following

data want;
  _max_n_with_Y = 1e12;

  do _n_ = 1 by 1 until (last.id);
    set have;
    by id;
  if flag='Y' then _max_n_with_Y = _n_;
 end;

  do _n_ = 1 to _n_;
   set have;
   if _n_ <= _max_n_with_Y then OUTPUT;
  end;
 drop _:;
run;

Any help is appreciated.

Thanks


Solution

  • The important 'landmark' is the row with flag='Y'

    The extra criteria for outputting rows post-landmark complicate the state machine being coded to track (or compute) the row number (_n_) for last output of the group.

    The row='Y' state is easily known. Unconditional use of LAG can be used to examine the post-Y state. SAS IF statements do not have short circuit evaluation, so as long as the LAG is not in a subordinate THEN clause, the LAG stacks will be appropriate for the task.

    Example:

    data have;
    attrib
      id format=4.
      date informat=mmddyy10. format=mmddyy10.
      evt_type length=$4
      flag length=$1
      amt1 amt2 format=4.
    ;
    input
       id   Date        Evt_Type   Flag   Amt1   Amt2; datalines;
      101  2/2/2019      Fee       .      5      .
      101  2/3/2019      REF1      Y      .      5
      101  2/4/2019      Fee       .      10     .
      101  2/6/2019      REF2      Y      .      10
      101  2/7/2019      Fee       .       4     .
      101  2/8/2019      REF1      .      .      .
      102  2/2/2019      Fee       .      25     .
      102  2/2/2019      REF1      N      25     .
      103  2/3/2019      Fee       .      10     .
      103  2/4/2019      REF1      Y      .      10
      103  2/5/2019      Fee       .      10     .
    ;
    
    data want;
      _y_n = 1e12;
    
      do _n_ = 1 by 1 until (last.id);
        set have;
        by id;
    
        if flag='Y' then _y_n = _n_;
    
        /* rule: post Y output of two rows should only occur once, and at the rows
         * immediately succeeding the Y row
         */
        if _n_ = _y_n + 2            /* is this row 2 after a Y */
          and lag(evt_type) = 'Fee'  /* is first row after Y Fee */
          and evt_type =: 'REF'      /* is second row after Y REF# */
        then 
          _upto_n = _n_;
      end;
    
      _upto_n = max (_upto_n, _y_n);
    
      do _n_ = 1 to _n_;
       set have;
       if _n_ <= _upto_n then OUTPUT;
      end;
    
      drop _:;
    run;
    

    Note, regarding:

    if _n_ = _y_n + 2            /* is this row 2 after a Y */
      and lag(evt_type) = 'Fee'  /* is first row after Y Fee */
      and evt_type =: 'REF'      /* is second row after Y REF# */
    then 
      _upto_n = _n_;
    

    For the row 2 after Y,

      LAG2(<var>) is the <var> value from the Y row
      LAG (<var>) is the <var> value from the Y row+1
           <var>  is the <var> value from the Y row+2, which is the current row