Search code examples
sassamplingresampling

Create unique label for repeated units with PROC SURVEYSELECT in SAS


I need to resample from a real (cluster) trial data set. So far, I have used the following PROC SURVEYSELECT procedure in SAS to sample 10 clusters from the trial with replacement, with 50% of clusters coming from the control arm and 50% coming from the treatment arm. I repeat this 100 times to get 100 replicates with 10 clusters each and equal allocation.

proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
  cluster site;
  strata rx / alloc=(0.5 0.5);
run;

Since I am using unrestricted random sampling (method=urs) to sample with replacement, I specified outhits so that SAS will inform me when a cluster is sampled more than once in each replication.

However, within each replicate in the output resamples dataset, I have not found a way to easily assign a unique identifier to clusters that appear more than once. If a cluster is sampled m times within a replicate, the observations within that cluster are simply repeated m times.

I attempted to use PROC SQL to identify distinct cluster ids and their occurrences within each replication, thinking I could use that to duplicate IDs as appropriate before joining additional data as necessary.

proc sql;
    create table clusterselect as
    select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;

However, I cannot figure out how to simply replicate rows in SAS.

Any help is appreciate, whether it be modifying PROC SURVEYSELECT to yield a unique cluster id within each replication or repeating cluster IDs as appropriate based on numberhits.

Thank you!


Here's what I've done:

/* 100 resamples with replacement */
proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
  cluster site;
  strata rx / alloc=(0.5 0.5);
run;

/* identify unique sites per replicate and their num of appearances (numberhits) */
proc sql;
    create table clusterSelect as
    select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;

/* for site, repeat according to numberhits */
/* create unique clusterId */
data uniqueIds;
    set clusterSelect;
    do i = 1 to numberhits;
        clusterId = cat(site, i);
        output;
    end;
    drop i numberhits;
run;

/* append data to cluster, retaining unique id */
proc sql;
    create table resDat as
    select
        uid.replicate,
        uid.clusterId,
        uid.site,
        mydata.*
    from uniqueIds as uid
    left join mydata
    on uid.site = mydata.site
quit;

Solution

  • Are you just asking how to convert one observation into the number of observations indicated in the NUMBERHITS variable?

    data want;
      set resamples;
      do _n_=1 to numberhits;
        output;
      end;
    run;