I have a data set with two variables x
and y
. x
has four distinct values 1, 2, 3, and 4. I want to first take a simple random sample of size 2 from these 4 unique values and keep the corresponding rows.
Say, I get the SRS of 1, 2, then I will keep the first 7 rows as a new data set. If I get a SRS of 2 and 3, then I will keep the fifth to the eighth rows. Here is a simple example to start with. Thank you.
data dataone;
input x y;
datalines;
1 3
1 4
1 5
1 8
2 3
2 7
2 9
3 2
4 8
4 5
;
run;
You can do this one of two ways.
PROC SURVEYSELECT will do the work for you, if you do it in two steps: first give it a dataset of just X unique values, then merge to another dataset that has all values.
Alternately, you can do this in a datastep, where you first determine if you're going to take that X value, then take/not take the rows.
data want;
set have;
by x;
call streaminit(7);
retain need 2; *need 2 total;
retain have 4; *have 4 total - this could be determined programatically.;
retain keep; *and these three could all be one retain statement, this is just more readable;
if first.x then do;
if need/have ge rand('Uniform') then keep=1;
else keep=0;
need + -keep;
have + -1;
put need= have= _t=;
end;
if keep;
run;
This is a modified form of Reservoir Sampling. It works out to identical results of SRS, even though the odds of taking any one X appear to be different.