Search code examples
bashaverage

Change bash script to append 0s to reach a certain line number


I have this command line that calculates average, standard deviation, median, min value and max value.

cut -f3 $file | sort -n | awk 'BEGIN {c = 0; sum = 0;} $1 ~ /^(\-)?[0-9]*(\.[0-9]*)?$/ {a[c++] = $1; sum += $1; sumsq+=$1*$1} END {ave = sum / c; if( (c % 2) == 1 ) { median = a[ int(c/2) ]; } else {median = ( a[c/2] + a[c/2-1] ) / 2; } OFS="\t"; sd = sqrt(sumsq/c - (sum/c)**2); print ave, sd, median, a[0], a[c-1]; }' >> $output

Each file has millions of lines, that should adds up to exactly 772,474,283. But many files don't, and if they don't this biases the statistics I want to calculate. I should thus add 0 values(after having cut the third column) so that the total line number adds up to this number and that the zeros are taken into account while calculating the average, sd, etc.

I guess I should calculate line number on my file with wc -l and then appends X zeroes, with X=772,474,283-line number? How can I do that?

Or do you know a more elegant way?

Thanks!

M


Solution

  • calculate line number on my file with wc -l and then appends X zeroes, with X=772,474,283-line number? How can I do that?

    something like this

    min=772474283    # a billion lines really?
    
    cut3min() {
        [ $# -lt 1 ] && return
        f="$1"; shift
        cut -f3 "$f"
        n=$(wc -l <"$f")
        [ "$n" -ge "$min" ] && return
        for ((i=$n; i<$min; i++)) { echo 0; }
    }
    
    # cut -f3 "$file" | ...
    cut3min "$file" | ...
    

    but it's a shell, so it will be very slow if you process billions of lines

    that simple loop+maths is better to be written in a language like C