Search code examples
bashawkaveragestandard-deviation

Standard Deviation from multiple files in bash


I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes

1,M,CA,54.9130  
1,M,CA,54.9531  
1,M,CA,54.8845  
1,M,CA,54.7517  
1,M,CA,54.8425  
1,M,CA,55.2648  
1,M,CA,55.0876 

I have calculated the mean using

#!/bin/bash


files=`ls res*.cs`  
for f in $files; do 
        echo "$f" 
        echo " " 
        #Count number of lines N 
        lines=`cat $f | wc -l` 
        #Sum Total 
        sum=`cat $f | awk -F "," '{print $4}' | paste -sd+ | bc` 
        #Mean 
        mean=`echo "scale=5 ; $sum / $lines" | bc` 
        echo "$mean" 
        echo " "

I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is

S.D=sqrt((1/N)*(sum of (value - mean)^2))

But I am unsure how I would implement this into my script.


Solution

  • awk is powerful enough to calculate the mean of one file easily

    $ awk -F, '{sum+=$4} END{print sum/NR}' file
    

    to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)

     $ awk -F, '{sum+=$4; ss+=$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
     54.9567 0.15778
    

    this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.

    The simplest is then using this in a for loop for the files

    for f in res*.cs
    do 
        awk -F, '{sum+=$4; ss+=$4^2} 
             END {print FILENAME; 
                  print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
    end
    

    to run res1.cs .. res37.cs in that order, easiest is change the for loop

    for f in res{1..37}.cs
    # the rest of the code not changed.
    

    which will expand in the numerical order specified.