Search code examples
bashfor-loopawkcycle

How to use the value in a file as input for a calculation in awk - in bash?


I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.

Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value and that's a single number, the output only contains that.

How do I make the calculation "value is greater than" for each row of ${i}_counts against ${i}_percentage-value? In other words, how to use the number inside the file as a numerical value for a math operation?

Data:

data.csv (an extract)

SampleID    ASV    Count
1000A   ASV_1216    14
1000A   ASV_12580   150
1000A   ASV_12691   260
1000A   ASV_135     434
1000A   ASV_147     79
1000A   ASV_15      287
1000A   ASV_16      361
1000A   ASV_184     8
1000A   ASV_19      42

samples-ID-short

1000A
1000B
1000C

So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.

Here's my code so far:

    for i in $(cat samplesID-short)
    do
    grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
    grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
    awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
    awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value

#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.

    wc -l ${i}_ASV > n
    for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp; 
    awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;

#How the output should be:

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
    echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
    rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
    done &

Solution

  • Would you please try the following:

    awk -v OFS="\t" '
        NR==FNR {   # this block is executed in the 1st pass only
            if (FNR > 1) sum[$1] += $3
                    # accumulate the "count" for each "SampleID"
            next
        }
                    # the following block is executed in the 2nd pass only
        FNR > 1 {   # skip the header line
            if ($1 != prev_id) {
                    # SampleID has changed. then update the output filename and print the header line
                if (outfile) close(outfile)
                    # close previous outfile
                outfile = $1 "_summary"
                print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
                prev_id = $1
            }
            mark = ($3 > sum[$1] * 0.3) ? 1 : 0
                    # set the mark to "1" if the "Count" exceeds 30% of sum
            print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
                    # append the line to the summary file
        }
    ' data.csv data.csv
    

    data.csv:

    SampleID    ASV    Count
    1000A   ASV_1216    14
    1000A   ASV_12580   150
    1000A   ASV_12691   260
    1000A   ASV_135     434
    1000A   ASV_147     79
    1000A   ASV_15      287
    1000A   ASV_16      361
    1000A   ASV_184     8
    1000A   ASV_19      42
    1000B   ASV_1       90
    1000B   ASV_2       90
    1000B   ASV_3       20
    1000C   ASV_4       100
    1000C   ASV_5       10
    1000C   ASV_6       10
    

    In the following output examples, the last field ASV_over30% indicates 1 if the count exceeds 30% of the sum value.

    1000A_summary:

    ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
    ASV_1216        14      1635    490.5   0
    ASV_12580       150     1635    490.5   0
    ASV_12691       260     1635    490.5   0
    ASV_135 434     1635    490.5   0
    ASV_147 79      1635    490.5   0
    ASV_15  287     1635    490.5   0
    ASV_16  361     1635    490.5   0
    ASV_184 8       1635    490.5   0
    ASV_19  42      1635    490.5   0
    

    1000B_summary:

    ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
    ASV_1   90      200     60      1
    ASV_2   90      200     60      1
    ASV_3   20      200     60      0
    

    1000C_summary:

    ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
    ASV_4   100     120     36      1
    ASV_5   10      120     36      0
    ASV_6   10      120     36      0
    

    [Explanations]

    When calculating the average of the input data, we need to go through until the end of the data. If we want to print out the input record and the average value (or other information based on the average) at the same time, we need to use a trick:

    • To store the whole input records in memory.
    • To read the input data twice.

    As awk is suitable for reading multiple files changing the proceduce depending the order of files, I have picked the 2nd method.

    • The condition NR==FNR returns TRUE while reading the 1st file only. We calculate the sum of count field within this block as a 1st pass.
    • The next statement at the end of the block skips the following codes.
    • If the 1st file is done, the script reads the 2nd file which is same as the 1st file, of course.
    • While reading the 2nd file, the condition NR==FNR no longer returns TRUE and the 1st block is skipped.
    • The 2nd block reads the input file again, opening a file to print the output, reading the input data line by line, and adding information such as average value obtained in the 1st pass.