Search code examples
awktext-processing

awk equivalents for tidyverse concepts (melt and spread)


I have some text logs that I need to parse and format into CSV. I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand.

I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done.

Example truncated input data (Scrap.log):

; these are comment lines
; *******************************************************************************
; \\C:\Users\Computer\Folder\Folder\Scrap.log

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header1 Header2 Header3 Header4 Header5 Header6 Header7 
10 12.23 1.91 6.63 1.68 50.03 0.50 13.97
11 11.32 1.94 6.64 1.94 50.12 0.58 15.10
12 12.96 2.15 6.57 2.12 55.60 0.62 16.24
13 11.43 2.18 6.60 2.36 50.89 0.68 17.39
14 14.91 2.32 6.64 2.59 56.09 0.73 18.41
15 13.16 2.38 6.53 2.85 51.62 0.81 19.30
16 15.02 2.50 6.67 3.05 56.22 0.85 20.12

!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 
_START Header8 Header9 Header10 Header11 Header12 Header13 Header14
10 22.03 24.41 15.01 51.44 44.28 16.57 11.52
11 21.05 24.62 15.62 51.23 45.42 16.47 11.98
12 20.11 24.64 16.38 52.16 46.59 16.54 12.42
13 24.13 24.93 17.23 52.34 47.72 16.51 12.88
14 27.17 24.95 18.06 52.79 48.72 16.45 13.30
15 22.87 25.04 19.27 53.01 49.50 16.47 13.63
16 23.08 25.22 20.12 53.75 50.64 16.55 14.03

Expected output (truncated):

HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18
...

My general steps in the R script:

  1. add a single header row with new names at the top of file
  2. spread the top row (starting with !!G) to each row
  3. melt the header column (_START) from wide to long format

Pieces I have working in awk so far include:

  1. how to grab and print the header lines

awk '/_START/ {header = $0; print header}' Scrap.log

  1. How to write a single row with the new header values

awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}'

  1. I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk?

awk '/!!G/,/\n/ {print}' Scrap.log

alternatively, I tried setting RS/FS parameters like:

awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log

I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly.

How do I combine all these pieces to get to the CSV format?


Solution

  • I think the following:

    awk '
    BEGIN{
        # output the header line
        print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value"
    }
    # ignore comment lines
    /;/{next}
    
    /!!G/{
        valcnt = 1
        # save and shuffle the values
        val[valcnt++] = $2
        val[valcnt++] = $11
        val[valcnt++] = $12
        val[valcnt++] = $13
        val[valcnt++] = $14
        val[valcnt++] = $15
        val[valcnt++] = $3
        val[valcnt++] = $4
        val[valcnt++] = $5
        val[valcnt++] = $6
        val[valcnt++] = $7
        val[valcnt++] = $8
        val[valcnt++] = $9
        val[valcnt++] = $10
        next
    }
    /_START /{
        # these are headers - save them to head, to be reused later
        for (i = 2; i <= NF; ++i) {
            # fun fact: its indexed on NF
            head[i] = $i
        }
        next
    }
    
    # this function is redundant, but its just easier for me to think about the code
    function output(firstval, header, value, \
            cur, i) {
        cur = valcnt
        val[cur++] = firstval
        val[cur++] = header
        val[cur++] = value
        # output val as csv
        for (i = 1; i < cur; ++i) {
            printf "%s%s", val[i], i != cur - 1 ? "," : "\n"
        }
    }
    
    /[0-9]+/{
        for (i = 2; i <= NF; ++i) {
            # add these 3 to all the other values and output them
            # ie. add first column, the header from header and the value
            output($1, head[i], $i)
        }
    }
    
    '
    

    Should output what you want. Tested on repl.