Search code examples
awkmergeaveragemedian

average from same rows in one columns


I would like to create awk code, which could find same number in $4th column for all rows and count median and average for $4,$5 column and merge it to one row.

INPUT:

     chr10  89623184    89624315    1   0.2062732989
     chr10  89623184    89624315    1   0.2155829606
     chr10  89623184    89624315    1   0.2471277767
     chr10  89623184    89624315    2   0.2236894639
     chr10  89623184    89624315    2   0.2291949506
     chr10  89623184    89624315    2   0.2252482173

OUTPUT:

                             MEAN   MEDIAN         MEAN

chr10 89623184    89624315    1   0.2155829606  0.2229946787 
chr10 89623184    89624315    2   0.2252482173  0.2260442106

thank you for help.


Solution

  • awk to the rescue!

      awk '{
        k = $1 FS $2 FS $3 FS $4
        s[k] += $5
        v[k] = k in v ? v[k] FS $5 : $5
      }        
      END {
        for(i=1;i<=length(k);i++) printf "%s"," "
        printf " %s       %s\n","MEDIAN","MEAN"
        for (k in s) {
          n = split(v[k], m, FS)
          asort(m)
          median = n%2==0 ? (m[n/2]+m[n/2+1])/2 : m[(n-1)/2]
          print k, median, s[k]/n
        }
      }' file
    
    
                              MEDIAN       MEAN
    chr10 89623184 89624315 1 0.2062732989 0.222995
    chr10 89623184 89624315 2 0.2236894639 0.226044
    

    your key needs to be the first four fields, not just 4. For median calculation if there are even number of observations computes the average of mid two.