Search code examples
arraysloopssassas-macro

Split SAS datasets by column with primary key


So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this? Thanks in advance.


Solution

  • A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:

    data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
      set sashelp.class;
    run;
    

    So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.

    I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.

    So for the example above those macro variable values would be:

    %let ds=sashelp.class;
    %let key=name;
    %let nvars=2;
    

    So use PROC CONTENTS to get the list of variable names:

    proc contents data=&ds noprint out=contents; run;
    

    Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.

    data groups;
      length group 8 memname $41 varnum 8 name $32 ;
      group +1;
      memname=cats('split',group);
      do varnum=1 to &nvars while (not eof);
        set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
        output;
      end;
    run;
    

    Now you can use that dataset to drive the generation of the code:

    data _null_;
      set groups end=eof;
      by group;
      if _n_=1 then call execute('data ');
      if first.group then call execute(cats(memname,'(keep=&key'));
      call execute(' '||trim(name));
      if last.group then call execute(') ');
      if eof then call execute(';set &ds;run;');
    run;
    

    Here are results from the SAS log:

    NOTE: CALL EXECUTE generated line.
    1    + data
    2    + split1(keep=name
    3    +  Age
    4    +  Height
    5    + )
    6    + split2(keep=name
    7    +  Sex
    8    +  Weight
    9    + )
    10   + ;set sashelp.class;run;
    
    NOTE: There were 19 observations read from the data set SASHELP.CLASS.
    NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
    NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.