Search code examples
awkmean

Get the mean of comma-separated values using AWK


I have the following data, separated by spaces:

SampleX=1,1,2,3 SampleY=1,1,2,3 SampleZ=1,1,2,3 
SampleX=0,0,0,1 SampleY=.,.,.,. SampleZ=0,1,1,1

I want to do a summary of each row, getting the mean for each value in all samples, ignoring dots. So first row, there are 3 samples, each sample contains 4 values. The mean of each value are represented as follows:

A B C D
1 1 2 3
0 0.5 0.5 1

I already did something similar but instead of mean, I got counts as follows:

SampleX=1/1 SampleY=0/0 SampleZ=1/1 
SampleX=0/1 SampleY=./. SampleZ=1/0

Result is count of events (0/0=A, 1/0 or 0/1=B and 1/1=C)

A B C
1 0 2
0 2 0

Using:

    echo "SampleA=1/1 SampleB=0/0 SampleC=1/1 
SampleA=0/1 SampleB=./. SampleC=1/0" | awk 'BEGIN {OFS="\t"; print "A\tB\tC"} {B=gsub(/0\|1|1\|0|0\/1|1\/0/, ""); A=sub(/1\|1|1\/1/, ""); C=gsub(/0\|0|0\/0/, ""); print A,B,C}'

But I'm not sure how to do the same with commas and get the mean instead of counts.


Solution

  • Using any awk:

    $ cat tst.awk
    {
        delete tot
        delete cnt
        for ( i=1; i<=NF; i++ ) {
            n = split($i,f,/[=,]/)
            for ( j=2; j<=n; j++ ) {
                if ( f[j] != "." ) {
                    tot[j] += f[j]
                    cnt[j] ++
                }
            }
        }
        if ( NR == 1 ) {
            for ( j=2; j<=n; j++ ) {
                val = substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",j-1,1)
                printf "%s%s", val, (j<n ? OFS : ORS)
            }
        }
        for ( j=2; j<=n; j++ ) {
            val = ( cnt[j] ? tot[j] / cnt[j] : 0 )
            printf "%s%s", val, (j<n ? OFS : ORS)
        }
    }
    

    $ awk -f tst.awk file
    A B C D
    1 1 2 3
    0 0.5 0.5 1