Search code examples
awksum

awk sum values of a column, if other columns remain constant


I have a file (example.csv) with around 5M rows, each row containing 4 columns (user, day, type, value), like this:

user1,2022-01-01,type1,0.1
user1,2022-01-01,type1,0.9
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,0.3
user3,2022-01-01,type1,0.2
user3,2022-01-01,type1,0.5

I would like to sum the values (4th column in this example) that correspond to the same user, day and type, so the expected output should look like this:

user1,2022-01-01,type1,1.0
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,1.0

I tried something like this to try if it works

awk -F"," '!seen[$1]++;&&!seen[$2]++;&&!seen[$3]++;sum+=$4{print sum}' example.csv

but I am still far from the correct solution. Any suggestions?


Solution

  • $ awk '
    BEGIN {
        FS=OFS=","
    }
    {
        a[$1 OFS $2 OFS $3]+=$4
    }
    END {
        for(i in a) 
            print i,sprintf("%.1f",a[i])
    }' file
    

    Output:

    user2,2022-01-01,type1,1.0
    user2,2022-01-01,type2,1.0
    user1,2022-01-01,type1,1.0
    user3,2022-01-01,type1,1.0
    user1,2022-01-02,type1,1.0
    user1,2022-01-02,type2,1.0
    

    The output order is awk implementation dependent. If needed, use sort or GNU AWK's PROCINFO["sorted_in"].