Search code examples
sassummary

SAS make summary statistic not available in proc mean


I have a table with very many columns but for the in order to explain my problem I will use this simple table.

data test;
    input a b c;
    datalines;
    0 0 0
    1 1 1   
    . 4 2
    ;        
run;

I need to calculate the common summary statistic as min, max and number of missing. But I also need to calculate some special numbers as number of values above a certain level( in this example >0 and >1.

I can use proc mean but it only give me results for normal things like min, max etc.

What I want is result on the following format:

var minval maxval nmiss n_above1 n_above2
a    0     1   1     1        0
b    0     4   0     2        1
c    0     2   0     2        1   

I have been able to make this informat for one variable with this rather stupid code:

data result;
    set test(keep =b) end=last;
    variable = 'b';
    retain minval maxval;
    if _n_ = 1 then do;
        minval = 1e50;
        maxval = -1e50;
    end;
    if minval > b then minval = b;
    if maxval < b then maxval = b;
    if b=. then nmiss+1;
    if b>0 then n_above1+1;
    if b>2 then n_above2+1;
    if last then do;        
        output;
    end;
    drop b;
run;

This produce the following table:

variable minval maxval nmiss n_above1 n_above2
b        0      4      0     2        1

I know there has to be better way do this. I am used to Python and Pandas. There I will only loop through each variable, calculate the different summary statistick and append the result to a new dataframe for each variable.

I can probably also use proc sql. The next example

proc sql;
    create table res as
    select count(case when a > 0 then 1 end) as n_above1_a,
           count(case when b > 0 then 1 end) as n_above1_b,
           count(case when c > 0 then 1 end) as n_above1_c
    from test;
quit;

This gives me:

n_above1_a n_above1_b n_above1_c
1          2          2

But this do not solve my problem.


Solution

  • If you add an unique identifier to each row then you can just use PROC TRANSPOSE and PROC SQL to get your result.

    data test;
      input a b c;
      id+1;
    datalines;
    0 0 0
    1 1 1   
    . 4 2
    ;        
    proc transpose data=test out=tall ;
     by id ;
    run;
    proc sql noprint ;
      create table want as 
        select _name_
             , min(col1) as minval
             , max(col1) as maxval
             , sum(missing(col1)) as nmiss
             , sum(col1>1) as n_above1
             , sum(col1>2) as n_above2
        from tall
        group by _name_
      ;
    quit;
    

    Result

    Obs    _NAME_    minval    maxval    nmiss    n_above1    n_above2
    
     1       a          0         1        1          0           0
     2       b          0         4        0          1           1
     3       c          0         2        0          1           0