Search code examples
arraysloopssasdo-loopssas-studio

Transforming variables en masse in SAS


I've been searching for a question related but I have had no luck as of yet. I am looking to transform a long list of independent variables for a regression analysis. A dummy dataset would look like this:

DATA TEST (DROP = i);
    DO i = 1 to 4000;
        VAR = i + 100000;
        output;
    end;
run;

PROC TRANSPOSE
    DATA = TEST
    OUT = TEST_T
        (DROP = _NAME_)
    PREFIX = X_;
    ID VAR;
    VAR VAR;
RUN;

DATA TEST_ARRAY;
    SET TEST_T;

    ARRAY X[*] X_:;

    DO J = 1 TO 40;
        DO I = 1 TO DIM(X);
            X[I] = RANUNI(0)*I;
            OUTPUT;
        END;
    END;
RUN;

In this case the variable names X_i are increasing monotonically, in reality, my variables are actually X_number where the number is a six digit unique identifier. I've been trying to log transform and square all these variables such that I have a new X matrix with the following columns

X_133456 X_SQ_133456 LOG_X_133456

I tried looping a list through all variables like this

PROC CONTENTS
    DATA = TEST_ARRAY
    OUT = CONTENTS;
RUN;

PROC SQL NOPRINT;
    SELECT NAME INTO: REG_FACTORS
        SEPARATED BY " "
            FROM CONTENTS;
QUIT;

DATA WANT;
SET TEST_ARRAY;
%LET index = 1;
%DO %UNTIL (%SCAN(&REG_factors.,&index.," ")=);
    %LET factors = %SCAN(&REG_factors.,&index.," ");
    LOG_X_&FACTORS. = LOG(X_&FACTORS.);
    X_SQ_&FACTORS. = (X_&FACTORS.) ** 2;
    %LET index = %EVAL(&Index + 1);
%END;
RUN;

but this blows up my server and I need to find a more efficient way of doing this, thanks in advance

EDIT: for the contributor - I managed to solve at 13:04

%LET input_factors = X_:;

PROC SQL;
    SELECT 
            NAME
        ,   TRANWRD(NAME,%SCAN(&input_factors.,1,'_'),'SQ')
        ,   TRANWRD(NAME,%SCAN(&input_factors.,1,'_'),'LOG')
    INTO        
            :factor_list        separated by " "
        ,   :sq_factor_list     separated by " "
        ,   :log_factor_list    separated by " "
    FROM
        contents
    WHERE
        VARNUM < 5
    WHERE
        NAME LIKE "%SCAN(&input_factors.,1,'_')_"
    ORDER BY
        INPUT(SCAN(NAME,-1,'_'),8.)
    ;
QUIT;

%PUT &factor_list.;
%PUT &sq_factor_list.;
%PUT &log_factor_list.;

Solution

  • Use 3 arrays, one for the input values (e.g. X_31415), and two for the new calculated values (log and square).

    The trick is to dynamically generate the variable names for the calculated variables, based on the original variable names.

    /* Use dictionary table to get/generate vnames */
    proc sql ;
      select name, /* X_31415 */
             tranwrd(name,'X_','X_SQ_'), /* X_SQ_31415 */
             tranwrd(name,'X_','LOG_X_') /* LOG_X_31415 */
        into :VARLIST separated by ' ',
             :SQLIST separated by ' ',
             :LOGLIST separated by ' '
      from dictionary.columns
      where libname = 'WORK'
        and memname = 'MYDATA'
        and name like 'X_%'
      order by input(scan(name,-1,'_'),8.) /* order based on the numeric suffix */
      ;
    quit ;
    

    Now you can assign three arrays, looping over the input values and calculate the square and log accordingly.

    data array3 ;
      set mydata ;
    
      array in{*} &VARLIST ; /* X_1     X_17     X_31415     X_99999     */
      array sq{*} &SQLIST ;  /* X_SQ_1  X_SQ_17  X_SQ_31415  X_SQ_99999  */
      array lg{*} &LOGLIST ; /* LOG_X_1 LOG_X_17 LOG_X_31415 LOG_X_99999 */
    
      do i = 1 to dim(in) ;
        sq{i} = in{i} ** 2 ;
        lg{i} = log(in{i}) ;
      end ;
    
      drop i ;
    run ;