Search code examples
sassas-macro

SAS how to create a macro program procsurveyselect


I need to do a random simple sampling withouth replacement of 50k on a df of 500k data.

So I have this dataframe located in lib_d1.df

| Var A | Var B || Var C | Var D || Var E | Var F || Var G | Var H |

I want to create a macro program data_sample(list_var,nb_obs,table_name)

%macro data_sample(list_var, nb_obs, table_name)
data_null_ nobs= n;       
Set lib_d1.df;
call symput("nb_obs", n);
run;

    
proc surveyselect seed=30602 data=&list_var n=&nb_obs out=&table_name;
run;         
%mend        
%let list_var= Var A, Var D, Var E, Var H;
&let nb_obs= &nb_obs;
%let table_name= data_50000;
%data_sample(list_var,nb_obs,table_name);

yeah that's where I'm at, and i have no clue what is right (or what is wrong...any help would be appreciated)


Solution

  • First of all you might want to consider renaming all your variable's names to comply with SAS naming conventions. I would not recommend to play around with variables that have a space in their names. However, if you want to keep your data set as it is right now, try the following:

    %macro data_sample(input=,list_var=,obs=,output=);
    proc surveyselect seed=30602 
                      data=&input.(keep= &list_var.)
                      n=&obs. 
                      out=&output.;
    run;         
    %mend;
       
    %data_sample(input=have,
                list_var='Var A'n 'Var D'n 'Var E'n 'Var H'n,
                obs=50000,
                output=want);
    

    method=srs is simple random sampling, which is selection with equal probability and without replacement. Find more information in the PROC SURVEYSELECT documentation. By default, this is set to SRS but always good to know.

    By the way, I don't think the first data step you made would run because nobs=n has to be placed after the set statement. Also, I don't know why you are computing the number of observations in your data set if you are only interested in a fixed sample of 50 000 observations.

    I would suggest to consider using the following if you want to compute a macro-variable (in that case nbobs) with the number of observations in your data set:

    data _null_;
    if 0 then set lib_d1.df nobs=n;
    call symputx('nbobs',n);
    stop;
    run;