Search code examples
countsasdistinctproc

Count distinct count over a large SAS dataset


I have a SAS dataset for 200million records with Date, Id and amount for year 2022

  Have:
      ID       Date         Amount
     101      1/31/2022      50
     102      1/31/2022      100
     101      2/28/2022      25
     .....................

I am trying to get the unique # of accounts for each month but the it is taking more than an hour to give me the results

    proc sql;
      create table unique as select distinct 
      date,
      count(distinct id) as uniqid
      from Have
      group by date
    quit;

Is there an alternate method to have the results quickly?


Solution

  • One alternate available method is to use proc freq with the nlevels option and access its ods output. The one catch is that your data must be sorted by date. If it is not, you'll need to sort your 200M row table by date which could take much longer than a SQL statement alone. If you have to do this, try keeping only the two variables of interest in a temporary table.

    proc sort data = have(keep=date id)
              out  = have_sorted;
        by date;
    run;
    
    ods select none;
    
    proc freq data=have_sorted nlevels;
       by date;
       tables id;
       ods output nlevels=unique_count(keep   = date nlevels 
                                       rename = (nlevels = uniqid)
                                      );
    run;
    

    Other options include trying to tune your bufno and bufsize system options. Check out this case study from SANrd on how this affects I/O.

    Also consider looking into the %freqyeah macro which parallelizes proc freq: From FREQing Slow to FREQing Fast: Facilitating a Four-Times-Faster FREQ with Divide-and-Conquer Parallel Processing