I have a csv file with 6 columns and 1M rows. The sample file is like this:
19;19;2021-06-01;0;France;10.3991
19;19;2021-06-01;0;Germany;0
19;19;2021-06-01;0;Others;8.08681
19;19;2021-06-01;0;Portugal;2486.39
19;19;2021-06-01;0;Spain;7.70791
19;19;2021-06-01;2;Switzerland;5.3688
19;19;2021-06-01;2;United Kingdom;0.00256085
19;19;2021-06-01;2;France;13.4166
19;19;2021-06-01;4;Germany;0
All the columns can have different values (for example, the first and second columns can have values such as "19", "20", "21", .. and so on. The third column goes from 2021-06-01 to 2021-06-29. Fourth column can have "2", "4", "6", ...up to "22". Fifth column can have different nationalities).
I need to compute the average of the 6th column based on the values from the days "2021-06-02", "2021-06-09", "2021-06-16" and "2021-06-16", and assign it to the day "2021-06-30".
But some values in some columns are not present in all the days.
In other words, I need to check if the values of the columns 1,2,4,5 are the same for these different days, then compute the average of column 6 for this four days.
For example:
19;19;2021-06-02;0;France;Value1
19;19;2021-06-09;0;France;Value2
19;19;2021-06-16;0;France;Value3
19;19;2021-06-23;0;France;Value4
Then, Averagevalue=(Value1+Value2+Value3+Value4)/4
With this, I can assign the average value as:
19;19;2021-06-30;0;France;Averagevalue
Next,
19;19;2021-06-02;2;France;Value1
19;19;2021-06-09;2;France;Value2
19;19;2021-06-23;2;France;Value3
Then, Averagevalue=(Value1+Value2+Value3)/3
With this, I can assign the average value as:
19;19;2021-06-30;2;France;Averagevalue
And repeat this procedure for all the possible values of the columns 1,2,4 and 5.
I tried this:
awk -F";" '{if($3=="2021-06-02"||$3=="2021-06-09"||$3=="2021-06-16"||$3=="2021-06-23") seen[$1";"$2";"$4";"$5]+=$6 count[$1";"$2";"$4";"$5]++} END { for (i in seen) print i, seen[i]/count[i] }' input.csv
without success. Any ideas?
I finally solved my problem with two passes of awk:
awk -F";" '{if($3=="2021-06-02"||$3=="2021-06-09"||$3=="2021-06-16"||$3=="2021-06-23") print $1";"$2";"$4";"$5";"$6}' june2021-no30.csv > 4dnd
awk -F";" '{s[$1";"$2";"$3";"$4]+=$5 c[$1";"$2";"$3";"$4]++ } END { for (i in s) print i";"s[i]/c[i]";20210630" }' 4dnd > 4dave
Then, with sort and some reformatting I got the file I needed