Search code examples
awkcompareuniq

Compare two files with same ID, subtract and add to end of line


I have 2 files with the format of

File A

01  20200111  28.56  22.07  40.14  49.79  22.81  49.31  33.75  31.24  39.41  36.18
02  20200118  32.41  14.89  38.82  60.54  11.54  49.10  34.34  25.53  36.96  34.30
03  20200125  21.95  18.48  28.45  42.94  22.45  42.40  36.43  34.53  43.51  33.28
...
01  20210109  29.32  24.60  34.41  46.56  29.38  44.06  34.47  33.75  41.12  36.04
02  20210116  29.38  28.39  33.07  42.63  29.46  39.41  32.45  33.60  43.81  34.93
03  20210123  27.51  19.55  32.98  45.88  26.05  46.88  37.58  31.98  44.52  35.90


File B

01  19912020  24.20  16.70  31.28  45.98  19.26  42.57  33.43  29.35  39.96  32.22
02  19912020  24.29  15.46  29.02  44.47  19.24  40.57  32.98  29.80  40.26  31.36
03  19912020  20.69  14.07  29.90  45.89  21.34  44.71  35.90  31.61  42.00  33.08

I would like to take the difference of Columns 3-12 of each id (Column 1) in File A from the same unique id in File B and add it at the end of each line in File A. File A has multiple instances of the same id. So each instance would be compared to the same id in File B.

My desired output is:

01   20200111    28.56  22.07  40.14  49.79  22.81  49.31  33.75  31.24  39.41  36.18  4.36    5.37    8.86    3.81    3.55    6.74    0.32    1.89   -0.55    3.96
02   20200118    32.41  14.89  38.82  60.54  11.54  49.10  34.34  25.53  36.96  34.30  8.12   -0.57     9.8   16.07    -7.7    8.53    1.36   -4.27    -3.3    2.94
03   20200125    21.95  18.48  28.45  42.94  22.45  42.40  36.43  34.53  43.51  33.28  1.26    4.41   -1.45   -2.95    1.11   -2.31    0.53    2.92    1.51     0.2
...
01   20210109    29.32  24.60  34.41  46.56  29.38  44.06  34.47  33.75  41.12  36.04  5.12     7.9    3.13    0.58   10.12    1.49    1.04     4.4    1.16    3.82
02   20210116    24.29  15.46  29.02  44.47  19.24  40.57  32.98  29.80  40.26  31.36  5.09   12.93    4.05   -1.84   10.22   -1.16   -0.53     3.8    3.55    3.57
03   20210123    20.69  14.07  29.90  45.89  21.34  44.71  35.90  31.61  42.00  33.08  6.82    5.48    3.08   -0.01    4.71    2.17    1.68    0.37    2.52    2.82

The above output (cols 3-22) needs to be in the following format (%8.2f)


XX   20230708  74.05  68.12  76.00  81.01  63.48  81.06  73.59  67.78  74.86  73.59   5.60  -1.14   1.22   2.50  -4.27   0.37   0.03   5.25   1.74   0.61

I attempted to use something like the below (I edited the code provided by markp-fuso(thanks!)) but the output appears to be alternating b/t File A and File B for each ID.

awk '
FNR==NR { for (i=3;i<=NF;i++)                    # 1st file: loop through 3rd-12th fields
              b[$1][i]=$i                        # store field values in 2-dimensional array
          next                                   # skip to next input line
        }
$1 in b { printf "%s %s%s", $1, OFS, $2           # 2nd file: if $1 is index in first dimension of array b[] then printf first 2 fields and then ...
          for (i=3;i<=NF;i++)                    # loop through 3rd-12th fields
              printf "%s%8.2f%8.2f", OFS, $i,$i-b[$1][i]    # print the difference
          print ""                               # terminate the current line of output
        }

Any help is appreciated.


Solution

  • Assumptions/understandings based on comments from OP's previous question as well as measurements of the desired output:

    • 1st column will have a max width of 2 characters
    • we are to maintain the leading 0 of the 1st column
    • we need to expand the amount of space between the 1st and 2nd columns from 2x spaces to 3x spaces
    • we need to reformat all numeric output columns to %8.2f (even though inputs and desired outputs appear to be formatted to %7.2f; output diffs also vary between .2f and .1f)
    • we need to append all of the diffs to the end of the line

    Modifying OP's current awk attempt:

    awk '
    FNR==NR { for (i=3;i<=NF;i++)
                  b[$1][i]=$i
              next
            }
    $1 in b { diffs = ""                                       # initialize our string of diffs
              printf "%2s  %s", $1, $2                         # hardcode the 2x spaces; no need to reference OFS
              for (i=3;i<=NF;i++) {                            # loop through 3rd-12th columns
                  printf "%8.2f", $i                           # print current column
                  diffs = diffs sprintf("%8.2f", $i-b[$1][i])  # append new diff to "diffs" string
              }
              print diffs                                      # print "diffs" string and terminate current line of output
            }
    ' File_B File_A
    

    This generates:

    01  20200111   28.56   22.07   40.14   49.79   22.81   49.31   33.75   31.24   39.41   36.18    4.36    5.37    8.86    3.81    3.55    6.74    0.32    1.89   -0.55    3.96
    02  20200118   32.41   14.89   38.82   60.54   11.54   49.10   34.34   25.53   36.96   34.30    8.12   -0.57    9.80   16.07   -7.70    8.53    1.36   -4.27   -3.30    2.94
    03  20200125   21.95   18.48   28.45   42.94   22.45   42.40   36.43   34.53   43.51   33.28    1.26    4.41   -1.45   -2.95    1.11   -2.31    0.53    2.92    1.51    0.20
    01  20210109   29.32   24.60   34.41   46.56   29.38   44.06   34.47   33.75   41.12   36.04    5.12    7.90    3.13    0.58   10.12    1.49    1.04    4.40    1.16    3.82
    02  20210116   29.38   28.39   33.07   42.63   29.46   39.41   32.45   33.60   43.81   34.93    5.09   12.93    4.05   -1.84   10.22   -1.16   -0.53    3.80    3.55    3.57
    03  20210123   27.51   19.55   32.98   45.88   26.05   46.88   37.58   31.98   44.52   35.90    6.82    5.48    3.08   -0.01    4.71    2.17    1.68    0.37    2.52    2.82