Search code examples
joinawk

Combine two files if the values from the first column of each file are the same


I want to combine two files if the values from the first column of each file are the same. In case that they are not the same, add values NaN. The output should have the number of lines from the file that has the maximum number of lines from both files - so, in the case that the first file has 200 line and the second files has 180 lines then the output should have 200 lines, but also if the first file has 180 lines and the second files has 200 lines, the output should still have 200 lines. Because I don't know which of the files has more lines compared to the other file, the script should verify which of the files has more line for a specific value from column 1 - for example if file 1 has 13 entries regarding value 140 and in the second file I have 12 entries regarding value 140 then the output should have a total of 13 entries in which 12 entries contains values and one entry is with NaN. This should work also if the file two has 13 entries and file one has only 12 entries.
In the case that one of the files is empty, the output should contain the elements from the file that have data and NaN for the other 2 columns from the empty file. In the case that the first column does not match, it should contain the values from the file that have data and NaN for the other 2 column from the file that don't match. The data from the two input files should be sorted numerically based on the first column.

For example, if in the first file I have:

140 12.49
140 12.33
140 10.62
140 8.57
140 10.09
140 9.82
140 11.48
140 13.07
140 14.09
140 15.02
140 14.20
140 15.05
140 16.15
141 14.91
141 13.62
141 13.09
141 11.54
141 9.86
141 8.37
141 7.95
141 7.55
141 7.86
141 7.92
141 9.99
141 10.82
141 9.83
142 9.31
142 9.36
142 8.77
142 9.52
142 9.92
142 9.56
142 10.50
142 12.53
142 12.05
142 11.30
142 11.52
142 12.61
142 15.05

And on the second files I have:

140 12.70
140 12.64
140 10.52
140 8.61
140 10.11
140 9.88
140 11.48
140 13.12
140 14.16
140 15.07
140 14.25
140 15.16
140 16.25
141 14.84
142 9.54
142 8.74
142 9.68
142 10.00
142 9.64
142 10.64
142 12.69
142 12.15
142 11.51
142 11.69
142 12.77
142 15.25 

I want the results to be:

140 12.49   140 12.70
140 12.33   140 12.64
140 10.62   140 10.52
140 8.57    140 8.61
140 10.09   140 10.11
140 9.82    140 9.88
140 11.48   140 11.48
140 13.07   140 13.12
140 14.09   140 14.16
140 15.02   140 15.07
140 14.20   140 14.25
140 15.05   140 15.16
140 16.15   140 16.25
141 14.91   141 14.84
141 13.62   141 NaN
141 13.09   141 NaN
141 11.54   141 NaN
141 9.86    141 NaN
141 8.37    141 NaN
141 7.95    141 NaN
141 7.55    141 NaN
141 7.86    141 NaN
141 7.92    141 NaN
141 9.99    141 NaN
141 10.82   141 NaN
141 9.83    141 NaN
142 9.31    142 9.54
142 9.36    142 8.74
142 8.77    142 9.68
142 9.52    142 10.00
142 9.92    142 9.64
142 9.56    142 10.64
142 10.50   142 12.69
142 12.53   142 12.15
142 12.05   142 11.51
142 11.30   142 11.69
142 11.52   142 12.77
142 12.61   142 15.25
142 15.05   142 NaN

In the case that the file 2 had the entries from files 1 that the output should be similar but the column 2 from the output should have NaN - for example:

140 15.16 140 15.05 140 16.25 140 16.15
141 14.84 141 14.91
141 NaN 141 13.62
141 NaN 141 13.09
..................... 142 15.25 142 12.61
142 NaN 142 15.05

awk 'NR==FNR{a[$1]=$2; next} {print $1, ($1 in a) ? a[$1] : "NaN", $1, $2}' file1.txt file2.txt

Thank you kindly for all the recommendations.


Solution

  • I would approach this as a merge join operation:

    1. both files must be sorted by the join column (in this case both files are sorted by the first column)
    2. read one row from each file
    3. compare the 1st columns
    4. if 1st columns match then print both rows to stdout and go to step 2
    5. if 1st column from file1 is lower then print file1's row + file1's 1st column + "NaN", read a new row from file1, go to step 3
    6. if 1st column from file2 is lower then print file2's row + file2's 1st column + "NaN", read a new row from file2, go to step 3
    7. continue until there are no more rows to read from both files

    There are a few ways to do this in awk

    • read both files into memory (aka arrays) then process the arrays (max memory usage)
    • read first file into memory and feed 2nd file to awk (half the max memory)
    • read first file into memory and use getline to read the 2nd file (half the max memory usage)
    • use getline to process both files (minimal memory usage)

    One awk idea using getline to process both files:

    $ cat merge_join.awk
    BEGIN {
    delete col1                                                              # occasionally I run into cases where awk
    delete col2                                                              # confuses arrays for scalars; the "delete"
    delete rc                                                                # serves to type these variables as arrays
    while (1) {
       if (! (1 in col1) )       { rc[1]=getline < f1                        # if no entries in array for index "1" then read new line from file1 and ...
                                   col1[1]=$1; col2[1]=$2                    # store our columns in our arrays
                                 }
       if (! (2 in col1) )       { rc[2]=getline < f2                        # if no entries in array for index "2" then read new line from file2 and ...
                                   col1[2]=$1; col2[2]=$2                    # store our columns in our arrays
                                 }
    
       if ( ! rc[1] && ! rc[2] ) break                                       # if no more rows in both files then break out of the while loop
       if ( ! rc[1]            ) { col1[1]=col1[2]; col2[1]="NaN" }          # if no more rows in file1 then default the col1[1] and col2[1] entries
       if (            ! rc[2] ) { col1[2]=col1[1]; col2[2]="NaN" }          # if no more rows in file2 then default the col1[2] and col2[2] entries
    
       if ( col1[1] == col1[2] ) { print col1[1], col2[1], col1[2], col2[2]
                                   delete col1; delete col2                  # delete both col1[] and col2[] entries
                                 }
       else
       if ( col1[1] <  col1[2] ) { print col1[1], col2[1], col1[1],   "NaN"
                                   delete col1[1]                            # delete just the f1/file1 entry
                                 }
       else                      { print col1[2], "NaN"  , col1[2], col2[2]
                                   delete col1[2]                            # delete just the f2/file2 entry
                                 }
      }  # while
    }    # BEGIN
    

    NOTES: as currently coded the script will continue to run getline < fN even after all rows have been read from fN, though the return code (stored in rc[N] will flag the operation as a failure); I've opted to continue operations based on the return codes (rc[N]) instead of halting the associated getline < fN

    Test data:

    $ cat file1
    130 130.130
    130 131.132
    140 140.140
    140 141.141
    140 142.142
    
    $ cat file2
    140 149.149
    140 148.148
    150 150.159
    150 150.158
    

    Running the script against the two test files:

    $ awk -f merge_join.awk -v f1=file1 -v f2=file2
    130 130.130 130 NaN
    130 131.132 130 NaN
    140 140.140 140 149.149
    140 141.141 140 148.148
    140 142.142 140 NaN
    150 NaN 150 150.159
    150 NaN 150 150.158