awk equivalents for tidyverse concepts (melt and spread)

I have some text logs that I need to parse and format into CSV. I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand.

I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done.

Example truncated input data (Scrap.log):

; these are comment lines
; *******************************************************************************
; \\C:\Users\Computer\Folder\Folder\Scrap.log

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header1 Header2 Header3 Header4 Header5 Header6 Header7 
10 12.23 1.91 6.63 1.68 50.03 0.50 13.97
11 11.32 1.94 6.64 1.94 50.12 0.58 15.10
12 12.96 2.15 6.57 2.12 55.60 0.62 16.24
13 11.43 2.18 6.60 2.36 50.89 0.68 17.39
14 14.91 2.32 6.64 2.59 56.09 0.73 18.41
15 13.16 2.38 6.53 2.85 51.62 0.81 19.30
16 15.02 2.50 6.67 3.05 56.22 0.85 20.12

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header8 Header9 Header10 Header11 Header12 Header13 Header14
10 22.03 24.41 15.01 51.44 44.28 16.57 11.52
11 21.05 24.62 15.62 51.23 45.42 16.47 11.98
12 20.11 24.64 16.38 52.16 46.59 16.54 12.42
13 24.13 24.93 17.23 52.34 47.72 16.51 12.88
14 27.17 24.95 18.06 52.79 48.72 16.45 13.30
15 22.87 25.04 19.27 53.01 49.50 16.47 13.63
16 23.08 25.22 20.12 53.75 50.64 16.55 14.03

Expected output (truncated):

HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18
...

My general steps in the R script:

add a single header row with new names at the top of file
spread the top row (starting with !!G) to each row
melt the header column (_START) from wide to long format

Pieces I have working in `awk` so far include:

how to grab and print the header lines

awk '/_START/ {header = $0; print header}' Scrap.log

How to write a single row with the new header values

awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}'

I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk?

awk '/!!G/,/\n/ {print}' Scrap.log

alternatively, I tried setting RS/FS parameters like:

awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log

I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly.

How do I combine all these pieces to get to the CSV format?

Solution

I think the following:

awk '
BEGIN{
    # output the header line
    print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value"
}
# ignore comment lines
/;/{next}

/!!G/{
    valcnt = 1
    # save and shuffle the values
    val[valcnt++] = $2
    val[valcnt++] = $11
    val[valcnt++] = $12
    val[valcnt++] = $13
    val[valcnt++] = $14
    val[valcnt++] = $15
    val[valcnt++] = $3
    val[valcnt++] = $4
    val[valcnt++] = $5
    val[valcnt++] = $6
    val[valcnt++] = $7
    val[valcnt++] = $8
    val[valcnt++] = $9
    val[valcnt++] = $10
    next
}
/_START /{
    # these are headers - save them to head, to be reused later
    for (i = 2; i <= NF; ++i) {
        # fun fact: its indexed on NF
        head[i] = $i
    }
    next
}

# this function is redundant, but its just easier for me to think about the code
function output(firstval, header, value, \
        cur, i) {
    cur = valcnt
    val[cur++] = firstval
    val[cur++] = header
    val[cur++] = value
    # output val as csv
    for (i = 1; i < cur; ++i) {
        printf "%s%s", val[i], i != cur - 1 ? "," : "\n"
    }
}

/[0-9]+/{
    for (i = 2; i <= NF; ++i) {
        # add these 3 to all the other values and output them
        # ie. add first column, the header from header and the value
        output($1, head[i], $i)
    }
}

'

Should output what you want. Tested on repl.

awk equivalents for tidyverse concepts (melt and spread)

Example truncated input data (Scrap.log):

Expected output (truncated):

Pieces I have working in awk so far include:

Pieces I have working in `awk` so far include: