Search code examples
sassizeprocbin

SAS: proc hpbin function


The data I have is

Year Score
2020  100
2020  45 
2020  82
.
.
.
2020  91
2020  14
2020  35

And the output I want is

Score_Ranking Count_Percent Cumulative_count_percent Sum
top100        x             y                        z 
101-200
.
.
.
800-900
900-989

The dataset has a total of 989 observations for the same year. I want to divide the whole dataset into 10 bins but set the size to 100. However, if I use the proc hpbin function, my results get divided into 989/10 bins. Is there a way I can determine the bin size?

Also, I want additional rows that show proportion, cumulative proportion, and the sum of the scores. How can I print these next to the bins?

Thank you in advance.


Solution

    1. Sort your data
    2. Classify into bins
    3. Use PROC FREQ for #/Cumulative Count
    4. Use PROC FREQ for SUM by using WEIGHT
    5. Merge results

    Or do 3-4 in same data step.

    I'm not actually sure what the first two columns will tell you as they will all be the same except for the last one.

    First generate some fake data to work with, the sort is important!

    *generate fake data;
    data have;
    do score=1 to 998;
    output;
    end;
    run;
    
    proc sort data=have;
    by score;
    run;
    
    

    Method #1

    Note that I use a view here, not a data set which can help if efficiency may be an issue.

    *create bins;
    data binned / view=binned;
    set have ;
    if mod(_n_, 100) = 1 then bin+1;    
    run;
    
    *calculate counts/percentages;
    proc freq data=binned noprint;
    table bin / out=binned_counts outcum;
    run;
    
    *calculate sums - not addition of WEIGHT;
    proc freq data=binned noprint;
    table bin / out=binned_sum outcum;
    weight score;
    run;
    
    *merge results together;
    data want_merged;
    merge binned_counts binned_sum (keep = bin count rename = count= sum);
    by bin;
    run;
    

    Method #2

    And another method, which requires a single pass of your data rather than multiple as in the PROC FREQ approach:

    *manual approach;
    data want;
    set have 
        nobs = _nobs /*Total number of observations in data set*/ 
        End=last /*flag for last record*/;
        
    *holds values across rows and sets initial value;   
    retain bin 1 count cum_count cum_sum 0 percent cum_percent ;
    
    *increments bins and resets count at start of each 100;
    if mod(_n_, 100) = 1 and _n_ ne 1 then do;
        *output only when end of bin;
        output;
        bin+1;
        count=0;    
    end;
    
    *increment counters and calculate percents;
    count+1;
    percent = count / _nobs;
    cum_count + 1;
    cum_percent = cum_count / _nobs;
    cum_sum + score;
    
    *output last record/final stats;
    if last then output;
    
    *format percents;
    format percent cum_percent percent12.1;
    
    run;