Search code examples
duplicatessas

Flag duplicates/triplicates or higher based on 2 variables in SAS


I have data that looks like this:

ID Year
A 1989
A 1989
A 1989
B 1990
B 1990
C 1991
C 1992

And want to output data with an additional Flag variable such that the first record of a duplicate or higher is 0, the second is 1, etc. based on both ID and Year. Observations that have a unique combination of ID and Year would not be flagged:

ID Year Flag
A 1989 0
A 1989 1
A 1989 2
B 1990 0
B 1990 1
C 1991
C 1992

I am only familiar with proc sort nodupkey as a way to eliminate duplicates, but I want to instead retain all data and add a flagging variable.


Solution

  • Try this

    data have;
    input ID $ Year;
    datalines;
    A 1989
    A 1989
    A 1989
    B 1990
    B 1990
    C 1991
    C 1992
    ;
    
    data want;
       do _N_ = 1 by 1 until (last.Year);
          set have;
          by ID Year;
       end;
    
       do flag = 0 to _N_ - 1;
          set have;
          if _N_ = 1 then flag = .;
          output;
       end;
    run;
    

    Result:

    ID  Year  flag
    A   1989  0
    A   1989  1
    A   1989  2
    B   1990  0
    B   1990  1
    C   1991  .
    C   1992  .