Search code examples
awktext-processingunix-text-processing

Awk: set RS to include newline and 1st (only) field of next row // logfile "splits" based on custom RS and print matching pattern therein


The short version of the question: To which value to set RS in awk to split records based on each line whose n-th field is empty ? (if line would be completely empty ,i.e. no Timestamp field in my examples, then setting RS="\n\n ..." would do.

The long version: This is how my log file looks like (notice the intertwined sections related to **amd64** resp. **arm64**) :

...
2023-12-29T16:05:20.3032116Z 
2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
2023-12-29T16:05:20.4085104Z 
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z 
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...
2023-12-29T16:05:20.6983744Z
2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...
2023-12-29T16:05:21.3972318Z 
...

.... as can be seen, each section ends with an line which doesn't contain anything except a Timestamp

The goal is to print separately the sections (lines) for each of amd64 and for arm64, e.g. (for amd64):

2023-12-29T16:05:20.4085104Z      <-- ideally be present in output
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z       <-- ideally be present in output
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz

The ideal solution would:

  • not mandatory to make use of awk, except when solutions in sed & co. are really overkill and more 'script-like'
  • be relatively easy to remember / and intuitively to replicate for repeated similar use-cases
  • not be too specific, i.e. work for other first (or n-th) field (not necessary for Timestamp-like formatted field)
  • not use any other extra Tools besides the main one (e.g. awk )

The followig solution only works (partially) but only if the log didn't have any fields in the empty lines (e.g. no Timestamp field): awk -vRS='\n\n' -vORS='\n\n' '/amd64 builder/ 1' logfile. however, and as an extra question: why (and how to correct it) does this solution print twice, in the first section of the output, the keyword searched for, i.e. amd64 in my case? Other (subsequent) sections only have the keyword once (as expected) ?

Thanks

LE: just realized that, without preserving the line with just the Timestamp in it, the output is hard to read .. so if you guys @Ed Morton and @markp-fuso could adjust a little bit your answers to preserve that line ? Thank you !


Solution

  • $ awk -v tgt='amd64' 'NF<2{f=""; next} !f{f=($3 ~ ("/"tgt"$"))} f' file
    2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
    2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
    2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
    2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
    2023-12-29T16:05:20.6982466Z #12 ...
    

    $ awk -v tgt='arm64' 'NF<2{f=""; next} !f{f=($3 ~ ("/"tgt"$"))} f' file
    2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
    2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
    2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
    2023-12-29T16:05:21.3971789Z #16 ...
    
    • NF<2{f=""; next} clears the flag f when there's only a timestamp on the line.
    • !f{f=($3 ~ ("/"tgt"$"))} sets f to 1 (if tgt is present) or 0 (otherwise) when each line that looks like #11 [linux/amd64 builder 1/8] is read.
    • f causes the current line to be printed when f is 1.

    I don't know why you thought setting RS to \n\n would work for you, it fails because doing so is unrelated to your problem.

    Given your comments, it sounds like this is what you're looking for (using GNU awk for multi-char RS, RT, and \S/\s):

    $ awk -v RS='\n\\S+\\s*\n' -v ORS= '/amd64/{print $0 RT}' file
    2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
    2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
    2023-12-29T16:05:20.5505699Z
    2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
    2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
    2023-12-29T16:05:20.6982466Z #12 ...
    2023-12-29T16:05:20.6983744Z
    

    $ awk -v RS='\n\\S+\\s*\n' -v ORS= '/arm64/{print $0 RT}' file
    2023-12-29T16:05:20.3032116Z
    2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
    2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
    2023-12-29T16:05:20.4085104Z
    2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
    2023-12-29T16:05:21.3971789Z #16 ...
    2023-12-29T16:05:21.3972318Z