I'm looking for an efficient way to produce sums by week ('weekstart') as a variable ('weekcount') within a dataset of tens of millions of rows. I need it as a variable for further analysis later on, before reducing the dataset. The code below takes too long to run and had to terminate, so I'm wondering if there's a more efficient way of producing these group sums for later use.
proc sql;
select weekstart, count(*) as Weekcount
from pharm
group by weekstart
order by weekstart;
quit;
Thank you
You can use a hash object to compute bin frequency when a very large table is not being processed efficiently by Proc FREQ
or Proc SQL
.
data have;
call streaminit(20230420);
do _n_ = 1 to 2e6;
pid = rand('integer', 1, 2.5e5);
weekstart = intnx ('week', '01jan2015'd + rand('integer', 1, 2500), 1);
output;
end;
format weekstart yymmdd10.;
run;
data weekstart_freq;
weekstart = 0;
declare hash bins(ordered:'a', suminc:'_n_');
bins.defineKey('weekstart');
bins.defineDone();
do while (not lastrow);
set have(keep=weekstart) end=lastrow;
bins.ref(); /* perform bin counting via suminc var */
end;
declare hiter foreachbin('bins');
do while (0=foreachbin.next());
bins.sum(sum:freq); /* retrieve bin count */
output;
end;
stop;
run;
Here is a similar alternative that uses the HASH
keysum:
argument to make the freq
variable automatically part of the .output ()
data set
data _null_ ;
weekstart = 0;
declare hash bins(ordered:'a', suminc:'_n_', keysum:'freq');
bins.defineKey('weekstart');
bins.defineDone();
retain freq 0 ;
do while (not lastrow);
set have(keep=weekstart) end=lastrow;
bins.ref(); /* perform bin counting (freq accumulation) via suminc var */
end ;
bins.output(dataset:'weekstart_freq2') ;
stop ;
run ;