How to use the value in a file as input for a calculation in awk - in bash?

I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.

Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value and that's a single number, the output only contains that.

How do I make the calculation "value is greater than" for each row of ${i}_counts against ${i}_percentage-value? In other words, how to use the number inside the file as a numerical value for a math operation?


data.csv (an extract)

SampleID    ASV    Count
1000A   ASV_1216    14
1000A   ASV_12580   150
1000A   ASV_12691   260
1000A   ASV_135     434
1000A   ASV_147     79
1000A   ASV_15      287
1000A   ASV_16      361
1000A   ASV_184     8
1000A   ASV_19      42



So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.

Here's my code so far:

    for i in $(cat samplesID-short)
    grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
    grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
    awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
    awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value

#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.

    wc -l ${i}_ASV > n
    for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp; 
    awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;

#How the output should be:

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
    echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
    rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
    done &


  • Would you please try the following:

    awk -v OFS="\t" '
        NR==FNR {   # this block is executed in the 1st pass only
            if (FNR > 1) sum[$1] += $3
                    # accumulate the "count" for each "SampleID"
                    # the following block is executed in the 2nd pass only
        FNR > 1 {   # skip the header line
            if ($1 != prev_id) {
                    # SampleID has changed. then update the output filename and print the header line
                if (outfile) close(outfile)
                    # close previous outfile
                outfile = $1 "_summary"
                print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
                prev_id = $1
            mark = ($3 > sum[$1] * 0.3) ? 1 : 0
                    # set the mark to "1" if the "Count" exceeds 30% of sum
            print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
                    # append the line to the summary file
    ' data.csv data.csv


    When calculating the average of the input data, we need to go through until the end of the data. If we want to print out the input record and the average value (or other information based on the average) at the same time, we need to use a trick:

    • To store the whole input records in memory.
    • To read the input data twice.

    As awk is suitable for reading multiple files changing the proceduce depending the order of files, I have picked the 2nd method.

    • The condition NR==FNR returns TRUE while reading the 1st file only. We calculate the sum of count field within this block as a 1st pass.
    • The next statement at the end of the block skips the following codes.
    • If the 1st file is done, the script reads the 2nd file which is same as the 1st file, of course.
    • While reading the 2nd file, the condition NR==FNR no longer returns TRUE and the 1st block is skipped.
    • The 2nd block reads the input file again, opening a file to print the output, reading the input data line by line, and adding information such as average value obtained in the 1st pass.