Search code examples
sassampling

Get a subset of a data set based on SRS of the unique values of a variable in SAS


I have a data set with two variables x and y. x has four distinct values 1, 2, 3, and 4. I want to first take a simple random sample of size 2 from these 4 unique values and keep the corresponding rows.

Say, I get the SRS of 1, 2, then I will keep the first 7 rows as a new data set. If I get a SRS of 2 and 3, then I will keep the fifth to the eighth rows. Here is a simple example to start with. Thank you.

data dataone;
  input x y;
  datalines;
  1 3
  1 4
  1 5
  1 8
  2 3
  2 7
  2 9
  3 2
  4 8
  4 5
  ;
run;

Solution

  • You can do this one of two ways.

    PROC SURVEYSELECT will do the work for you, if you do it in two steps: first give it a dataset of just X unique values, then merge to another dataset that has all values.

    Alternately, you can do this in a datastep, where you first determine if you're going to take that X value, then take/not take the rows.

    data want;
      set have;
      by x;
      call streaminit(7);
      retain need 2; *need 2 total;
      retain have 4; *have 4 total - this could be determined programatically.;
      retain keep;  *and these three could all be one retain statement, this is just more readable;
      if first.x then do;
        if need/have ge rand('Uniform') then keep=1;
        else keep=0;
        need + -keep;
        have + -1;
        put need= have= _t=;
      end;
      if keep;
    run;
    

    This is a modified form of Reservoir Sampling. It works out to identical results of SRS, even though the odds of taking any one X appear to be different.