I have the following data, separated by spaces:
SampleX=1,1,2,3 SampleY=1,1,2,3 SampleZ=1,1,2,3
SampleX=0,0,0,1 SampleY=.,.,.,. SampleZ=0,1,1,1
I want to do a summary of each row, getting the mean for each value in all samples, ignoring dots. So first row, there are 3 samples, each sample contains 4 values. The mean of each value are represented as follows:
A B C D
1 1 2 3
0 0.5 0.5 1
I already did something similar but instead of mean, I got counts as follows:
SampleX=1/1 SampleY=0/0 SampleZ=1/1
SampleX=0/1 SampleY=./. SampleZ=1/0
Result is count of events (0/0=A, 1/0 or 0/1=B and 1/1=C)
A B C
1 0 2
0 2 0
Using:
echo "SampleA=1/1 SampleB=0/0 SampleC=1/1
SampleA=0/1 SampleB=./. SampleC=1/0" | awk 'BEGIN {OFS="\t"; print "A\tB\tC"} {B=gsub(/0\|1|1\|0|0\/1|1\/0/, ""); A=sub(/1\|1|1\/1/, ""); C=gsub(/0\|0|0\/0/, ""); print A,B,C}'
But I'm not sure how to do the same with commas and get the mean instead of counts.
Using any awk:
$ cat tst.awk
{
delete tot
delete cnt
for ( i=1; i<=NF; i++ ) {
n = split($i,f,/[=,]/)
for ( j=2; j<=n; j++ ) {
if ( f[j] != "." ) {
tot[j] += f[j]
cnt[j] ++
}
}
}
if ( NR == 1 ) {
for ( j=2; j<=n; j++ ) {
val = substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",j-1,1)
printf "%s%s", val, (j<n ? OFS : ORS)
}
}
for ( j=2; j<=n; j++ ) {
val = ( cnt[j] ? tot[j] / cnt[j] : 0 )
printf "%s%s", val, (j<n ? OFS : ORS)
}
}
$ awk -f tst.awk file
A B C D
1 1 2 3
0 0.5 0.5 1