Search code examples
linuxshellawksh

rearranging with same timestamp using awk


I have a couple of files.

cat << EOF > file1.txt

 20190130150000 #timestamp-1
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999      6.99       0.000 -9999 -9999
 -9999 -9999      9.99       0.000 -9999 -9999
 20190130151000 #timestamp-2
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999     6.99       0.000 -9999 -9999
 -9999 -9999     9.99       0.000 -9999 -9999
 20190130152000 #timestamp-3
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999     6.99       0.000 -9999 -9999
 -9999 -9999     9.99       0.000 -9999 -9999

EOF

cat << EOF > file2.txt

 20190130150000 #timestamp-1
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999     5.99       0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999
 20190130151000 #timestamp-2
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999     5.99       0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999
 20190130152000 #timestamp-3
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999     5.99       0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999

EOF

I am looking for a awk script to rearrange it in the the following way:

Write all the same timestamps together with its subsequent rows:

Please note: There is no such #timestamp words in the files, I just mentioned it to make understandble.

So the desire output will be like:

cat << EOF > out.txt

 20190130150000
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999     6.99       0.000 -9999 -9999
 -9999 -9999     9.99       0.000 -9999 -9999
 20190130150000
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999     5.99       0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999
 20190130151000
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999      6.99      0.000 -9999 -9999
 -9999 -9999     9.99       0.000 -9999 -9999
 20190130151000
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999     5.99       0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999
 20190130152000
     56   39
  P                                         A
  S-4 B                               4.     E     F      3
 -9999 -9999     6.99       0.000 -9999 -9999
 -9999 -9999     9.99       0.000 -9999 -9999
 20190130152000
    33   55.3
  R                                         A
  S-4 B                                3.     E     F      3
 -9999 -9999      5.99      0.000 -9999 -9999
 -9999 -9999     7.99       0.000 -9999 -9999

EOF

My script:

awk '
{
    if ($1 ~ /^[0-9]{14}$/) {
        timestamp = $1
        print timestamp
        next
    }
    print
}' file1.txt file2.txt

This script is just printing the same inputs. It is not printing desire output.


Solution

  • Using the Decorate-Sort-Undecorate idiom with any awk+sort+cut for any number of input files of practically any size and containing any order and/or values of timestamps (so it'll work for 1 input file or 100000 input files where some have timestamp 1 before timestamp 2, some have timestamps that don't exist in any other file, massive input files, etc.)

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk -v OFS='\t' '$1 ~ /^[0-9]{14}$/{t=$1} {print t, NR, $0}' "${@:--}" |
    sort -k1,1n -k2,2n |
    cut -f3-
    

    $ ./tst.sh file{1,2}.txt
     20190130150000 #timestamp-1
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999      6.99       0.000 -9999 -9999
     -9999 -9999      9.99       0.000 -9999 -9999
     20190130150000 #timestamp-1
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999
     20190130151000 #timestamp-2
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999     6.99       0.000 -9999 -9999
     -9999 -9999     9.99       0.000 -9999 -9999
     20190130151000 #timestamp-2
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999
     20190130152000 #timestamp-3
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999     6.99       0.000 -9999 -9999
     -9999 -9999     9.99       0.000 -9999 -9999
     20190130152000 #timestamp-3
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999
    

    That's doing the following:

    1. Decorate - awk -v OFS='\t' '...' adds the most recent timestamp read from the input followed by the current line number in front of each input line so we can in step 2 sort on those to get the desired output order.
    2. Sort - sort -k1,1n -k2,2n sorts by the timestamp and then by line number added in step 1 to group all lines by timestamp while retaining their original relative input order.
    3. Undecorate - cut -f3- removes the leading timestamp and line number added by awk in step 1 to get back to just the original content of each input line.

    As well as the conciseness and flexibility of the above approach, one additional benefit over trying to do it all in awk or some other tool is that we're only storing 1 line at a time in memory using awk so it'll work even for massive input files since only sort has to handle all of the input "in memory" but it's designed to be able to do that by using demand paging, etc. while awk is not and in fact awk would slow down by orders of magnitude as it's array sizes increase if trying to store huge amounts of input.

    If you're still not sure what the above is doing, just run the awk script on it's own to see what it does (adds 2 fields to the front of the input):

    $ awk -v OFS='\t' '$1 ~ /^[0-9]{14}$/{t=$1} {print t, NR, $0}' file{1,2}.txt
    20190130150000  1        20190130150000 #timestamp-1
    20190130150000  2            56   39
    20190130150000  3         P                                         A
    20190130150000  4         S-4 B                               4.     E     F      3
    20190130150000  5        -9999 -9999      6.99       0.000 -9999 -9999
    20190130150000  6        -9999 -9999      9.99       0.000 -9999 -9999
    20190130151000  7        20190130151000 #timestamp-2
    20190130151000  8            56   39
    20190130151000  9         P                                         A
    20190130151000  10        S-4 B                               4.     E     F      3
    20190130151000  11       -9999 -9999     6.99       0.000 -9999 -9999
    20190130151000  12       -9999 -9999     9.99       0.000 -9999 -9999
    20190130152000  13       20190130152000 #timestamp-3
    20190130152000  14           56   39
    20190130152000  15        P                                         A
    20190130152000  16        S-4 B                               4.     E     F      3
    20190130152000  17       -9999 -9999     6.99       0.000 -9999 -9999
    20190130152000  18       -9999 -9999     9.99       0.000 -9999 -9999
    20190130150000  19       20190130150000 #timestamp-1
    20190130150000  20          33   55.3
    20190130150000  21        R                                         A
    20190130150000  22        S-4 B                                3.     E     F      3
    20190130150000  23       -9999 -9999     5.99       0.000 -9999 -9999
    20190130150000  24       -9999 -9999     7.99       0.000 -9999 -9999
    20190130151000  25       20190130151000 #timestamp-2
    20190130151000  26          33   55.3
    20190130151000  27        R                                         A
    20190130151000  28        S-4 B                                3.     E     F      3
    20190130151000  29       -9999 -9999     5.99       0.000 -9999 -9999
    20190130151000  30       -9999 -9999     7.99       0.000 -9999 -9999
    20190130152000  31       20190130152000 #timestamp-3
    20190130152000  32          33   55.3
    20190130152000  33        R                                         A
    20190130152000  34        S-4 B                                3.     E     F      3
    20190130152000  35       -9999 -9999     5.99       0.000 -9999 -9999
    20190130152000  36       -9999 -9999     7.99       0.000 -9999 -9999
    

    Then pipe that output to sort to see what changes (lines reordered according to the numeric value of those 2 added fields):

    $ awk -v OFS='\t' '$1 ~ /^[0-9]{14}$/{t=$1} {print t, NR, $0}' file{1,2}.txt |
        sort -k1,1n -k2,2n
    20190130150000  1        20190130150000 #timestamp-1
    20190130150000  2            56   39
    20190130150000  3         P                                         A
    20190130150000  4         S-4 B                               4.     E     F      3
    20190130150000  5        -9999 -9999      6.99       0.000 -9999 -9999
    20190130150000  6        -9999 -9999      9.99       0.000 -9999 -9999
    20190130150000  19       20190130150000 #timestamp-1
    20190130150000  20          33   55.3
    20190130150000  21        R                                         A
    20190130150000  22        S-4 B                                3.     E     F      3
    20190130150000  23       -9999 -9999     5.99       0.000 -9999 -9999
    20190130150000  24       -9999 -9999     7.99       0.000 -9999 -9999
    20190130151000  7        20190130151000 #timestamp-2
    20190130151000  8            56   39
    20190130151000  9         P                                         A
    20190130151000  10        S-4 B                               4.     E     F      3
    20190130151000  11       -9999 -9999     6.99       0.000 -9999 -9999
    20190130151000  12       -9999 -9999     9.99       0.000 -9999 -9999
    20190130151000  25       20190130151000 #timestamp-2
    20190130151000  26          33   55.3
    20190130151000  27        R                                         A
    20190130151000  28        S-4 B                                3.     E     F      3
    20190130151000  29       -9999 -9999     5.99       0.000 -9999 -9999
    20190130151000  30       -9999 -9999     7.99       0.000 -9999 -9999
    20190130152000  13       20190130152000 #timestamp-3
    20190130152000  14           56   39
    20190130152000  15        P                                         A
    20190130152000  16        S-4 B                               4.     E     F      3
    20190130152000  17       -9999 -9999     6.99       0.000 -9999 -9999
    20190130152000  18       -9999 -9999     9.99       0.000 -9999 -9999
    20190130152000  31       20190130152000 #timestamp-3
    20190130152000  32          33   55.3
    20190130152000  33        R                                         A
    20190130152000  34        S-4 B                                3.     E     F      3
    20190130152000  35       -9999 -9999     5.99       0.000 -9999 -9999
    20190130152000  36       -9999 -9999     7.99       0.000 -9999 -9999
    

    Then pipe to cut to see what changes (first 2 fields removed):

    $ awk -v OFS='\t' '$1 ~ /^[0-9]{14}$/{t=$1} {print t, NR, $0}' file{1,2}.txt |
        sort -k1,1n -k2,2n |
        cut -f3-
     20190130150000 #timestamp-1
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999      6.99       0.000 -9999 -9999
     -9999 -9999      9.99       0.000 -9999 -9999
     20190130150000 #timestamp-1
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999
     20190130151000 #timestamp-2
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999     6.99       0.000 -9999 -9999
     -9999 -9999     9.99       0.000 -9999 -9999
     20190130151000 #timestamp-2
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999
     20190130152000 #timestamp-3
         56   39
      P                                         A
      S-4 B                               4.     E     F      3
     -9999 -9999     6.99       0.000 -9999 -9999
     -9999 -9999     9.99       0.000 -9999 -9999
     20190130152000 #timestamp-3
        33   55.3
      R                                         A
      S-4 B                                3.     E     F      3
     -9999 -9999     5.99       0.000 -9999 -9999
     -9999 -9999     7.99       0.000 -9999 -9999