Search code examples
shellawkgrepsummary

Linux summary a table and sum all columns from each given occurrence


After finishing an analysis I get a table with a lot of columns and rows. Also, as a new table is written, the number of lines/cols can vary so I am not able to predict how many of each there will be. Every row has an index in column one but those indexes can repeat through the table. So what I want is a grep/awk/bash way to retrieve all those lines with the same index and sum all columns to get just one line with the summed values. As an illustration:

table

index,sampleA,sampleB,sampleC
nana,22,12,4
baba,47,4,5
nana,1,5,9
nana,7,5,8

after parsing

index,sampleA,sampleB,sampleC
nana,30,22,21
baba,47,4,5

I would appreciate so much if you could help me on that. Many thanks.


Solution

  • A little long winded, but something like this will do the job:

    awk -F"," 'BEGIN{OFS=FS} NR==1{print $0; next} NR>1{sampleA[$1]+=$2; sampleB[$1]+=$3; sampleC[$1]+=$4}END{for (sample in sampleA){print sample, sampleA[sample], sampleB[sample], sampleC[sample]}}' yourfile
    

    Explanation:

    1. Split each line by a comma -F","
    2. Before processing the file insure that the output field separator matches the field seperator (a comma) BEGIN{OFS=FS}
    3. If this is the first line of the file then print it out (it's the header) and then continue to the next line NR==1{print $0; next}
    4. If this isn't the header line then create three arrays to store the cumulative values from columns 2, 3, 4. The "key" of the arrays is the value in column 1 NR>1{sampleA[$1]+=$2; sampleB[$1]+=$3; sampleC[$1]+=$4}
    5. Finally at the end, loop through the first of the three arrays (they will all be the same length so we only need to loop one). Then print out the values stored in each one for that key: END{for (sample in sampleA){print sample, sampleA[sample], sampleB[sample], sampleC[sample]}}