I have a file (example.csv) with around 5M rows, each row containing 4 columns (user, day, type, value), like this:
user1,2022-01-01,type1,0.1
user1,2022-01-01,type1,0.9
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,0.3
user3,2022-01-01,type1,0.2
user3,2022-01-01,type1,0.5
I would like to sum the values (4th column in this example) that correspond to the same user, day and type, so the expected output should look like this:
user1,2022-01-01,type1,1.0
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,1.0
I tried something like this to try if it works
awk -F"," '!seen[$1]++;&&!seen[$2]++;&&!seen[$3]++;sum+=$4{print sum}' example.csv
but I am still far from the correct solution. Any suggestions?
$ awk '
BEGIN {
FS=OFS=","
}
{
a[$1 OFS $2 OFS $3]+=$4
}
END {
for(i in a)
print i,sprintf("%.1f",a[i])
}' file
Output:
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user1,2022-01-01,type1,1.0
user3,2022-01-01,type1,1.0
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
The output order is awk implementation dependent. If needed, use sort
or GNU AWK's PROCINFO["sorted_in"]
.