I have this command line that calculates average, standard deviation, median, min value and max value.
cut -f3 $file | sort -n | awk 'BEGIN {c = 0; sum = 0;} $1 ~ /^(\-)?[0-9]*(\.[0-9]*)?$/ {a[c++] = $1; sum += $1; sumsq+=$1*$1} END {ave = sum / c; if( (c % 2) == 1 ) { median = a[ int(c/2) ]; } else {median = ( a[c/2] + a[c/2-1] ) / 2; } OFS="\t"; sd = sqrt(sumsq/c - (sum/c)**2); print ave, sd, median, a[0], a[c-1]; }' >> $output
Each file has millions of lines, that should adds up to exactly 772,474,283. But many files don't, and if they don't this biases the statistics I want to calculate. I should thus add 0 values(after having cut the third column) so that the total line number adds up to this number and that the zeros are taken into account while calculating the average, sd, etc.
I guess I should calculate line number on my file with wc -l and then appends X zeroes, with X=772,474,283-line number? How can I do that?
Or do you know a more elegant way?
Thanks!
M
calculate line number on my file with wc -l and then appends X zeroes, with X=772,474,283-line number? How can I do that?
something like this
min=772474283 # a billion lines really?
cut3min() {
[ $# -lt 1 ] && return
f="$1"; shift
cut -f3 "$f"
n=$(wc -l <"$f")
[ "$n" -ge "$min" ] && return
for ((i=$n; i<$min; i++)) { echo 0; }
}
# cut -f3 "$file" | ...
cut3min "$file" | ...
but it's a shell, so it will be very slow if you process billions of lines
that simple loop+maths is better to be written in a language like C