Search code examples
eventssaslaglead

SAS: Marking next and previous observations with conflicts


I want to create dummies in SAS. My plan is, everytime AnnouncementDate ne . I want to make a variable called event=2 and the next one also (event=2) as well as the five previous observations 1 And all this flagged if the Previous 5 observations and the day after the announcement date are not empty. and in a by class (by stock for example)

output data

date announcementdate event conflict
1115 . . .
2115 . . .
3115 . . .
4115 . . .
5115 . . .
6115 . 1 .
7115 . 1 .
8115 . 1 .
9115 . 1 .
10115 . 1 .
11115 1115 2 .
12115 . 2 .
13115 . .
14115 . 1 .
16115 . 1 .
17115 . 1 .
18115 . 1 .
19115 . 1 .
20115 20115 2 
21115 . 1 Y
21115 . 1 Y
22115 22115 2 Y
23115 . 2 Y
24115 . .
25115 . .

If this now switches to the next stock, it should start over and not take the previous observations into account.

My current code does a lead term as I merge just the event column with firstobs=2. Anyways this 5x lag bothers me as well as the conflict when there are observations before.

data want1 ; set have;;
if announcementdate ne . then event=0;
run;

data want2; merge want1 (firstobs=2) want1 (keep=event rename= (event=event2)); 
if event2=0 then event=0; 
if event2=0 then event=event2;
run;

Thankful for any input!


Solution

  • My approach to this is to first parse the dataset to work out what the ranges would be, using _n_ as the indicator (rather than the id). Then you can build a simple start-end range and use that to build a format (well, two, one for 1 and one for 2).

    Then, parse that dataset with a look-ahead merge (to grab the next value of start) and see if a conflict exists; then use your logic to define another format for conflict markers. My example includes logic for this; it may or may not match your exact needs though, so you might have to think a bit more about this logic and modify this some as it's fairly specific.

    Then you just apply the format using _n_ as the input. Make sure the original dataset doesn't get sorted, of course.

    Here's the complete code.

    data have;
    input date announcementdate event_want conflict_want $;
    datalines;
    1115 . . .
    2115 . . .
    3115 . . .
    4115 . . .
    5115 . . .
    6115 . 1 .
    7115 . 1 .
    8115 . 1 .
    9115 . 1 .
    10115 . 1 .
    11115 1115 2 .
    12115 . 2 .
    13115 . . .
    14115 . 1 .
    16115 . 1 .
    17115 . 1 .
    18115 . 1 .
    19115 . 1 .
    20115 20115 2 .
    21115 . 1 Y
    21115 . 1 Y
    22115 22115 2 Y
    23115 . 2 Y
    24115 . . .
    25115 . . .
    ;;;;
    run;
    
    data for_fmt;
      set have nobs=nobs;
      length fmtname $16;
      fmtname = 'eventf';
    
      if _n_=1 then do;   *HLO='o' gives the 'other' value;
        start=.;
        end  =.;
        label = ' ';
        hlo='o';
        output;
      end;
      hlo=' ';
    
      if not missing(announcementdate) then do;
        start = max(1,_n_-5);
        end   = max(1,_n_-1);
        label='1';
        output;
        start = _n_;
        end   = min(_n_+1,nobs);
        label='2';
        output;
      end;
    run;  *now we have a dataset of the ranges - but they overlap;
    
    data for_fmt_conflicts;
    
      merge for_fmt(in=a) for_fmt(in=_in_next firstobs=2 keep=start rename=start=next_start); *look-ahead merge;
      retain new_start conflict_marker;
      if hlo='o' then do;    *again the 'other' row;
        output;
        fmtname='conflictf';
        output;
      end;
      else if end gt next_start and (_in_next) then do;   *if the next row will be a conflict, adjust the end back one and indicate where next should start;
        end = end-1;
        new_start = end+1;
        conflict_marker=1;
        output;
      end;
      else if conflict_marker=1 then do;  *if this is a conflict row type 1;
        start = new_start;
        output;
        fmtname='conflictf';
        label  ='Y';
        output;
        conflict_marker=2;
      end;
      else if conflict_marker=2 then do;  *if this is the 2s for a conflict row;
        output;
        fmtname='conflictf';
        label  ='Y';
        output;
        conflict_marker=0;
      end;
      else output;
    run;
    proc sort data=for_fmt_conflicts;  *must sort to get fmtnames grouped together;
      by fmtname start;
    run;
    proc format cntlin=for_fmt_conflicts;  *import formats;
    quit;
    
    data want;  *now apply formats;
      set have;
      event = put(_n_,eventf.);
      conflict = put(_n_,conflictf.);
    run;