Search code examples
bashsortingawkpercentile

Compute percentile and max value per variable


Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk

aa  1
ab  3
aa  4
ac  5
aa  3
ad  2
ab  4
ac  2
ae  2
ac  5

Expected output

Item   90th percentile   max value
aa     3.8             4
ab     3.9             4
ac     5               5
ad     2               2
ae     2               2

Am able to get the sum and max using the below, but not the percentile.

awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
 }
 END{
 for (var in item)
 print var,count[var],max[var],percentile[var]
 }
' 

Please suggest.


Solution

  • Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:

    $ cat mnp.awk
    BEGIN {
        PROCINFO["sorted_in"]="@ind_num_asc"   # for order in output
        if(p=="")                              # if p not defined it's median
            p=0.5
        else
            p=p/100                            # if 90th percentile: p=0.9
    }
    {
        v[$1][NR]=$2                           # values stored per keyword. NR for unique
        if($2>m[$1])                           # find max val
            m[$1]=$2
    }
    END {
        for(i in v) {                          # for all keywords
            n=asort(v[i])                      # sort values, n is count
            prc=p*n;                           # percentile figuration
            if(prc==int(prc))
                w=(v[i][prc]+v[i][prc+1])/2
            else
                w=v[i][int(prc)+1]
            print i, m[i], w                   # print keyword, max and nth value
        }
    }
    

    Run it:

    $ awk -p=90 -f mnp.awk data.txt
    aa 4 4
    ab 4 4
    ac 5 5
    ad 2 2
    ae 2 2
    

    TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.