Search code examples
awkaverage

awk compute average, but only for specific values in certain columns


I have a csv file with 6 columns and 1M rows. The sample file is like this:

19;19;2021-06-01;0;France;10.3991
19;19;2021-06-01;0;Germany;0
19;19;2021-06-01;0;Others;8.08681
19;19;2021-06-01;0;Portugal;2486.39
19;19;2021-06-01;0;Spain;7.70791
19;19;2021-06-01;2;Switzerland;5.3688
19;19;2021-06-01;2;United Kingdom;0.00256085
19;19;2021-06-01;2;France;13.4166
19;19;2021-06-01;4;Germany;0

All the columns can have different values (for example, the first and second columns can have values such as "19", "20", "21", .. and so on. The third column goes from 2021-06-01 to 2021-06-29. Fourth column can have "2", "4", "6", ...up to "22". Fifth column can have different nationalities).

I need to compute the average of the 6th column based on the values from the days "2021-06-02", "2021-06-09", "2021-06-16" and "2021-06-16", and assign it to the day "2021-06-30".

But some values in some columns are not present in all the days.

In other words, I need to check if the values of the columns 1,2,4,5 are the same for these different days, then compute the average of column 6 for this four days.

For example:

19;19;2021-06-02;0;France;Value1
19;19;2021-06-09;0;France;Value2
19;19;2021-06-16;0;France;Value3
19;19;2021-06-23;0;France;Value4

Then, Averagevalue=(Value1+Value2+Value3+Value4)/4

With this, I can assign the average value as:

19;19;2021-06-30;0;France;Averagevalue

Next,

19;19;2021-06-02;2;France;Value1
19;19;2021-06-09;2;France;Value2
19;19;2021-06-23;2;France;Value3

Then, Averagevalue=(Value1+Value2+Value3)/3

With this, I can assign the average value as:

19;19;2021-06-30;2;France;Averagevalue

And repeat this procedure for all the possible values of the columns 1,2,4 and 5.

I tried this:

awk -F";" '{if($3=="2021-06-02"||$3=="2021-06-09"||$3=="2021-06-16"||$3=="2021-06-23") seen[$1";"$2";"$4";"$5]+=$6 count[$1";"$2";"$4";"$5]++} END { for (i in seen) print i, seen[i]/count[i] }' input.csv 

without success. Any ideas?


Solution

  • I finally solved my problem with two passes of awk:

    awk -F";" '{if($3=="2021-06-02"||$3=="2021-06-09"||$3=="2021-06-16"||$3=="2021-06-23") print $1";"$2";"$4";"$5";"$6}' june2021-no30.csv > 4dnd
    
    awk -F";" '{s[$1";"$2";"$3";"$4]+=$5 c[$1";"$2";"$3";"$4]++ } END { for (i in s) print i";"s[i]/c[i]";20210630" }' 4dnd > 4dave
    

    Then, with sort and some reformatting I got the file I needed