Search code examples
sas

Keep a random row, by group in SAS


I have a dataset with company data and some duplicated companies by their id. These duplications are due differences in variables like city that it is in, or sector, among other variables.

This is a simplified example of my data

data have;
infile datalines delimiter=',';
input id $ city $ sector $;
datalines;
1,A,0015
1,B,0912
1,C,0812
2,D,0712
2,E,0612
2,F,0512
3,G,0412
3,H,0312
3,I,0212
4,J,1012
4,K,5512
4,L,5515
5,M,3042
5,O,2012
5,P,3012
6,Q,6012
6,Q,8012
6,Q,4012
7,R,6012
7,S,2012
7,T,0317
8,U,0016
8,V,0014
8,W,0032
9,Y,0312
9,Z,1012
9,A,0212
;
RUN;

I want to have only one line per company id, but I dont want to transpose the data nor to evaluate the differences on their other variables. Instead, I want random drop duplicated rows by id, keeping only one random row per company id. nodupkey alone doesn't work for me, because it always keep the first row.

I tried to adapt a code I saw in a SAS forum but it didnt work for me:

data temp;
drop seed;
set have;
by id;
retain random;
if first.id then do;
  seed = input(id, 8.);
  random = ranuni(seed) ;
end;
run;

proc sort;
data=temp out=want;
by id random;

proc sort;
data=want nodupkey;
by id;

Solution

  • If you fix the random number generation part it should work fine. There are some misplaced semicolons in the two sort steps you need clean up also.

    data temp;
      set have;
      random = rand('uniform');
    run;
    
    proc sort data=temp out=want;
      by id random;
    run;
    
    proc sort data=want nodupkey;
      by id;
    run;