Search code examples
awkrefactoringsubroutine

Making AWK code more efficient when evaluating sets of records


I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.

For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.

The Data Set is:

 00        31444      8.7        24    00:04:32
 00        44574     12.4        25    00:01:41
 00        74984     20.8        53    00:02:22
 00        84465     23.5        53    00:12:33
 01        34748      9.7        38    01:59:28
 01        44471     12.4        37    01:55:29
 01        74280     20.6        58    01:10:24
 01        80673     22.4        53    01:55:49

The desired Output for records 1 through 4 is 00:12:33 The desired output for records 5 through 8 is 01:10:24

Here is my answer:

Evaluate Records 1 through 4

awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt

Output is: 00:12:33

Evaluate Records 5 through 8

awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt

Output is 01:10:24

Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?

Thanks


Solution

  • Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):

    $ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
    00:12:33
    01:10:24