Search code examples
sqlsasprocproc-sql

How to write proc sql without windowfunction over partition by sum?


I just started learning SAS and realised that proc sql don't use window functions. As I am more at ease with sql I was wondering how I can simulate a sum window function in proc?

desired result

 select a.active, a.store_id, a.nbr, sum(nbr) over (partition by a.store_id)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
;
active store_id nbr sum
0 1 8 326
1 1 318 326
0 2 7 273
1 2 266 273

eg of raw data

select active, store_id, customer_id
from customer
limit 10;
active store_id customer_id
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10

current result and query

    select a.active, a.store_id, a.nbr, sum(nbr) 
    from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
    group by a.active, a.store_id, a.nbr;
active store_id nbr sum
0 1 8 8
1 1 318 318
0 2 7 7
1 2 266 266

Solution

  • You can do the equivalent in proc sql by merging two sub-queries: one for the count of customers by active, store_id, and another for the total customers for each store_id.

    proc sql noprint;
        create table want as
            select t1.active
                 , t1.store_id
                 , t1.nbr
                 , t2.sum
            from (select active
                       , store_id
                       , count(customer_id) as nbr
                  from have 
                  group by store_id, active
                 ) as t1
    
            LEFT JOIN
    
                 (select store_id
                       , count(customer_id) as sum
                  from have
                  group by store_id
                 ) as t2
    
            ON t1.store_id = t2.store_id
        ;
    quit;
    

    If you wanted to do this in a more SASsy way, you can run proc means and merge together the results from a single dataset that holds everything you need. proc means will calculate all possible combinations of your variables by default.

    proc means data=have noprint;
        class store_id active;
        ways 1 2;
    
        output out=want_total
            n(customer_id) = total
        ;
    run;
    
    data want;
        merge want_total(where=(_TYPE_ = 3) rename=(total = nbr) )
              want_total(where=(_TYPE_ = 2) rename=(total = sum) keep=_TYPE_ store_id total)
        ;
        by store_id;
    
        drop _:;
    run;
    

    Or, in SQL:

    proc sql;
        create table want as
            select t1.store_id
                 , t1.active
                 , t1.total as nbr
                 , t2.total as sum
            from want_total as t1
            LEFT JOIN
                 want_total as t2
            ON t1.store_id = t2.store_id
            where     t1._TYPE_ = 3
                  AND t2._TYPE_ = 2
        ;
    quit;
    

    The _TYPE_ variable identifies the level of the analysis. For example, _TYPE_ = 1 is for active only, _TYPE_ = 2 is for store_id only, and _TYPE_ = 3 is for all combinations. You can view this in the output dataset from proc means:

    store_id    active  _TYPE_  _FREQ_  total
    .           0       1        3      3
    .           1       1        7      7
    1           .       2        6      6
    2           .       2        4      4
    1           0       3        1      1
    1           1       3        5      5
    2           0       3        2      2
    2           1       3        2      2
    

    And if you wanted faster high-performance results, check out its big sibling, proc hpsummary.

    Therein lies the cool thing about SAS: You can bounce between PROCs, SQL, the DATA Step, and Python via Pandas/proc python. You can exploit the unique benefits of each of these methods and thought processes for any number of data engineering and statistics problems.