Search code examples
sqlsassas-macro

SAS union distinct records from datasets with similar names


I have about 100 large datasets and within each dataset I'm hoping to extract distinct IDs to join them vertically. The datasets are unsorted, named as data_01 , data_02, data_03 ....data_100.

Since the datasets are all very large, set them together without reducing the size is not feasible, the join didn't even move after hours of running. Therefore, I believe there is the need to reduce the datasets before stacking is necessary, and I'm here to seek some help.

I tried to create a macro to select distinct ID and sum a numerical variable,cnt, by ID before vertically joining all datasets by proc sql union. The macro is not working properly:

/*Get dataset names*/
proc sql noprint;
select  memname into :mylist separated by ' '
from dictionary.tables where libname= "mylib" and upcase(memname) like "DATA_%"
;
quit;

%put &mylist;

/*create union statements*/
%global nextdata;
%let nextdata =;
%macro combinedata(mylist);
  data _null_;
       datanum = countw("&mylist");
       call symput('Dataset', put(datanum, 10.));
  run;

   %do i = 1 %to  &Dataset ;
      data _null_;
          temp = scan("&mylist", &i);
          call symput("Dataname", strip(put(temp,$12.)));
      run;
       %put &Dataname;
       %put &Dataset;

        %if (&i=&Dataset) %then %do;
            %let nextdata = &nextdata.
                select id, sum(cnt)
                  from mylib.&&Dataname
                   group by id;
            %end;
        %else %do;
            %let nextdata = &nextdata.
               select id, sum(cnt)
                  from mylib.&&Dataname union
                   group by id;
         %end;

            %put nextdata = &nextdata;
    %end;


%mend combinedata;

%combinedata(&mylist);

/*execute from proc sql*/
proc sql;
    create table combined as (&nextdata);
quit;

I have also attempted to use proc summary, but there was not enough memory to run the following code:

data vneed / view=vneed;
  set data_: (keep=id cnt);
run;
proc summary data=vneed nway;
  class id;
  var cnt;
  output out=want (drop=_type_) sum=sumcnt;
run;

Appreciate any help!


Solution

  • Summarize the data as you go instead of trying to generate one massive query. Then re-aggregate the aggregates.

    proc sql ;
    %do i = 1 %to  &Dataset ;
      %let dataname=mylib.%scan(&mylist,&i,%str( ));
      create table sum&i as 
       select id,sum(cnt) as cnt 
       from &dataname 
       group by id
       order by id
      ;
    %end;
    quit;
    
    data want ;
      do until(last.id);
        set sum1 - sum&dataset ;
        by id;
        sumcnt+cnt;
      end;
      drop cnt;
    run;