I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes
1,M,CA,54.9130
1,M,CA,54.9531
1,M,CA,54.8845
1,M,CA,54.7517
1,M,CA,54.8425
1,M,CA,55.2648
1,M,CA,55.0876
I have calculated the mean using
#!/bin/bash
files=`ls res*.cs`
for f in $files; do
echo "$f"
echo " "
#Count number of lines N
lines=`cat $f | wc -l`
#Sum Total
sum=`cat $f | awk -F "," '{print $4}' | paste -sd+ | bc`
#Mean
mean=`echo "scale=5 ; $sum / $lines" | bc`
echo "$mean"
echo " "
I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is
S.D=sqrt((1/N)*(sum of (value - mean)^2))
But I am unsure how I would implement this into my script.
awk
is powerful enough to calculate the mean of one file easily
$ awk -F, '{sum+=$4} END{print sum/NR}' file
to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)
$ awk -F, '{sum+=$4; ss+=$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
54.9567 0.15778
this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.
The simplest is then using this in a for loop for the files
for f in res*.cs
do
awk -F, '{sum+=$4; ss+=$4^2}
END {print FILENAME;
print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
end
to run res1.cs .. res37.cs in that order, easiest is change the for loop
for f in res{1..37}.cs
# the rest of the code not changed.
which will expand in the numerical order specified.