I have about 100 large datasets and within each dataset I'm hoping to extract distinct IDs to join them vertically. The datasets are unsorted, named as data_01 , data_02, data_03 ....data_100.
Since the datasets are all very large, set them together without reducing the size is not feasible, the join didn't even move after hours of running. Therefore, I believe there is the need to reduce the datasets before stacking is necessary, and I'm here to seek some help.
I tried to create a macro to select distinct ID and sum a numerical variable,cnt, by ID before vertically joining all datasets by proc sql union. The macro is not working properly:
/*Get dataset names*/
proc sql noprint;
select memname into :mylist separated by ' '
from dictionary.tables where libname= "mylib" and upcase(memname) like "DATA_%"
;
quit;
%put &mylist;
/*create union statements*/
%global nextdata;
%let nextdata =;
%macro combinedata(mylist);
data _null_;
datanum = countw("&mylist");
call symput('Dataset', put(datanum, 10.));
run;
%do i = 1 %to &Dataset ;
data _null_;
temp = scan("&mylist", &i);
call symput("Dataname", strip(put(temp,$12.)));
run;
%put &Dataname;
%put &Dataset;
%if (&i=&Dataset) %then %do;
%let nextdata = &nextdata.
select id, sum(cnt)
from mylib.&&Dataname
group by id;
%end;
%else %do;
%let nextdata = &nextdata.
select id, sum(cnt)
from mylib.&&Dataname union
group by id;
%end;
%put nextdata = &nextdata;
%end;
%mend combinedata;
%combinedata(&mylist);
/*execute from proc sql*/
proc sql;
create table combined as (&nextdata);
quit;
I have also attempted to use proc summary, but there was not enough memory to run the following code:
data vneed / view=vneed;
set data_: (keep=id cnt);
run;
proc summary data=vneed nway;
class id;
var cnt;
output out=want (drop=_type_) sum=sumcnt;
run;
Appreciate any help!
Summarize the data as you go instead of trying to generate one massive query. Then re-aggregate the aggregates.
proc sql ;
%do i = 1 %to &Dataset ;
%let dataname=mylib.%scan(&mylist,&i,%str( ));
create table sum&i as
select id,sum(cnt) as cnt
from &dataname
group by id
order by id
;
%end;
quit;
data want ;
do until(last.id);
set sum1 - sum&dataset ;
by id;
sumcnt+cnt;
end;
drop cnt;
run;