Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk
aa 1
ab 3
aa 4
ac 5
aa 3
ad 2
ab 4
ac 2
ae 2
ac 5
Expected output
Item 90th percentile max value
aa 3.8 4
ab 3.9 4
ac 5 5
ad 2 2
ae 2 2
Am able to get the sum and max using the below, but not the percentile.
awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
}
END{
for (var in item)
print var,count[var],max[var],percentile[var]
}
'
Please suggest.
Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:
$ cat mnp.awk
BEGIN {
PROCINFO["sorted_in"]="@ind_num_asc" # for order in output
if(p=="") # if p not defined it's median
p=0.5
else
p=p/100 # if 90th percentile: p=0.9
}
{
v[$1][NR]=$2 # values stored per keyword. NR for unique
if($2>m[$1]) # find max val
m[$1]=$2
}
END {
for(i in v) { # for all keywords
n=asort(v[i]) # sort values, n is count
prc=p*n; # percentile figuration
if(prc==int(prc))
w=(v[i][prc]+v[i][prc+1])/2
else
w=v[i][int(prc)+1]
print i, m[i], w # print keyword, max and nth value
}
}
Run it:
$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2
TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.