Search code examples
sassas-iml

Creating a Control Group in SAS


How can I create a subset of a dataset in SAS where I want to determine the mean, variance, and number of observations for a variable before hand?

Example:

OBS  NAME  x1 x2
1    Bill  3  2
2    James 4  5
3    Rick  5  6
4    Bob   3  7
5    Clas  5  2
6    Brye  2  9
7    Mann  8  5
8    Pids  4  8
9    Tony  0  7
10   Lou   2  6 

Say I want a subset with 3 observations, mean(x1) = 4, and std(x1) = 0.95. How would I create this subset in SAS?

I would rather not do something using proc means and guess/check repeatedly. Any help is appreciated!

Update: Created a logistic model predicting whether the observations would be in the treatment or control group. Then took the top 10% with the highest probability of being in the treatment group, but were not included in the treatment group and essentially used as the control group.


Solution

  • This is a variation of the knapsack problem. You are trying to find a subset of objects (here, 3 people), such that their properties come close to some specified target values (here, the sum [or mean] and corrected sum of squares [or std dev]). This is also known as a moment-matching problem.

    As stated, the problem is not well defined. You need to specify an objective function to minimize. For example, you might choose the function (mean-target_mean)**2 + (stdDev - target_stdDev)**2 where (mean, stdDev) are the moments for each sampe of size 3.

    For small sets (as in your example), you can do a complete enumeration of the "N choose 3" combinations to figure out which combination to choose. See the article "Generate combinations in SAS" for tips. For example, in SAS/IML, you can solve the stated problem as follows:

    data A;
    length NAME $5.;
    input NAME $ x1 x2;
    datalines;
    Bill  3  2
    James 4  5
    Rick  5  6
    Bob   3  7
    Clas  5  2
    Brye  2  9
    Mann  8  5
    Pids  4  8
    Tony  0  7
    Lou   2  6 
    ;
    
    proc iml;
    use A; read all var {Name x1}; close;
    
    N = nrow(x1);              /* number of obs */          
    k = 3;                     /* size of subset */
    targetMean = 4;
    targetStd = 0.95;
    
    idx = allcomb(N, k);       /* all M='N choose 3' combinations */
    X = shape( x1[idx], nrow(idx) );
    mean = mean(X`);           /* 1 x M vector of sample means */
    std  = std(X`);            /* 1 x M vector of sample std devs */
    objective = (mean - targetMean)##2 + (std - targetStd)##2;
    minVal = objective[><];    /* minimize objective */
    minIdx = objective[>:<];   /* a sample that achieves minimum */
    
    sampNames = Name[idx[minIdx,]];
    sampVals = x1[idx[minIdx,]];
    print sampVals[rowname=sampNames];
    

    Of course, there might be more than one solution. This example has 8 solutions.

    For problems in which there are N items and you want a subset of size k, and for which 'N choose k' is prohibitively large, you can generate random subsets by using the RANCOMB function (or PROC SURVEYSELECT, as someone mentioned). Alternately, you can formulate this problem as an optimization problem. You can use algorithms in SAS/OR or SAS/IML to solve it. For moderate-sized subsets, you could use genetic algorithms in SAS/IML, which are useful for knapsack-like problems.