Search code examples
awkgawk

AWK - Using arrays to count by hour and unique value


I have the following input file:

Unit1 15 00:20:58
Unit1 30 01:10:00
Unit3 10 00:20:15
Unit2 5  00:45:00
Unit3 20 00:30:00
Unit2 2  01:22:35
Unit2 3  01:35:22
Unit1 5  00:58:20

For some background on this input file. It is a list of work Units for an e-portal that I have been tasked with analyzing. In the log file it provides the Unit name ($1) as well as the total number of questions that a student has completed ($2) before hitting submit which records the time ($3),tweaked to allow for a clearer example.

I would like to output the following:

Unit1
---------------------
00
========
20
--------
01 
========
30
--------

Unit2
---------------------
00
========
5
--------
01 
========
5
--------

Unit3
---------------------
00
========
30
--------

the Code I have currently is as follows:

#!/usr/bin/gawk -f

{ #Start of MID
        key = $1 #Message Extracted 10 Total
        key2 = substr($3,1,2) #Hour
        MSG_TYPE[key]++ #Distinct Message
        HOUR_AR[key2]++
        HT_AR[key2] += $2 #Tots up the total for each message by hour

} #End of MID
END {
                for (MSG in MSG_TYPE) {
                        print MSG
                        print "-----------------------------------"
                n=asorti(HOUR_AR, HOUR_SOR)
                for (i = 1; i <= n; i++) {
                            print HOUR_SOR[i]
                            print "========="
                            print HOUR_AR[HOUR_SOR[i]]
                            print "---------"
                            }
                            print "\n"
                    }
    } #End of END

The logic behind this code is that it get's all the unique values from $1 with the MSG_TYPE[]. This is then scanned in a for loop and prints out each value. The hour is collected by the HOUR_AR[] array and it sorted and then for each pass of the MSG for loop returns,hopefully, all the hours for that particular MSG and then it prints a sum of $2 for that hour AND MSG.

I am sorry this is long winded. Just wanted to provide enough detail. Any and all help is greatly appreciated.


Solution

  • for the given example, this codes gave output as you expected:

     awk -F'[ :]+' '{u[$1][$3]+=$2}
         END{for(i in u){
                print i;print "--------";
                for(j in u[i])
                   print j"\n====\n"u[i][j]"\n---"}}' file
    

    it outputs:

    Unit1
    --------
    00
    ====
    20
    ---
    01
    ====
    30
    ---
    Unit2
    --------
    00
    ====
    5
    ---
    01
    ====
    5
    ---
    Unit3
    --------
    00
    ====
    30
    ---
    

    Note the sorting part is not done in codes. But you got the idea, you can make the implementation easier if you used gnu awk's array of array.

    https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html#Arrays-of-Arrays