Search code examples
bashawkaveragemean

AWK print name of file and function result


I have a file with millions of entries in a column and for this reason I'm using awk, which is the fastest method I know for these calculations. I need to calculate the mean of values in a column and I have done it this way:

allsamples="R3 SM261_T SM382_T R6"

for sample in $allsamples
do
awk BEGIN {print "ID","Coverage"}; '{sum+=$2} END { print "Average = ",sum/NR}' $sample.dep > $sample.mean_coverage.temp >> All_samples_coverage.txt
done

The script works correctly and prints the headers I need but I also need to print the filename next to the mean value.

I have tried this:

awk 'BEGIN {print "ID","Coverage"}; {print FILENAME} {sum+=$2} END {print "Average = ",sum/NR}'

but it prints the filename for each line of the original file (so if R3.dep has 60 million lines, it will print 60 million times the filename and then the function result).

Example file would be:

Locus   Total_Depth Average_Depth_sample    Depth_for_R3
chr1:10001  4   4.00    4
chr1:10002  5   5.00    5
chr1:10003  7   7.00    7
chr1:10004  9   9.00    9

What I get is:

ID Coverage
R3.txt
R3.txt
R3.txt
R3.txt
R3.txt
Average =  5

What I would need is:

ID Coverage
R3.txt Average =  5

Any suggestion of what I'm doing wrong?


Solution

  • From what you stated, I believe your header should not be part of the AWK statement, simply a bash echo before the loop, since it seems like that is shared for all the files. I would also include the "Average" label as part of that header and remove it from the printf command shown below.

    Your AWK statement should then become

    awk 'BEGIN{
        sum=0 ;
    }{
        sum+=$2 ;
    }END{
        #printf("%10s:  Average = %s\n", FILENAME, sum/NR ) ;
        printf("%10s:  %s\n", FILENAME, sum/NR ) ;
    }'