I want to combine two files if the values from the first column of each file are the same. In case that they are not the same, add values NaN.
The output should have the number of lines from the file that has the maximum number of lines from both files - so, in the case that the first file has 200 line and the second files has 180 lines then the output should have 200 lines, but also if the first file has 180 lines and the second files has 200 lines, the output should still have 200 lines.
Because I don't know which of the files has more lines compared to the other file, the script should verify which of the files has more line for a specific value from column 1 - for example if file 1 has 13 entries regarding value 140 and in the second file I have 12 entries regarding value 140 then the output should have a total of 13 entries in which 12 entries contains values and one entry is with NaN. This should work also if the file two has 13 entries and file one has only 12 entries.
In the case that one of the files is empty, the output should contain the elements from the file that have data and NaN for the other 2 columns from the empty file. In the case that the first column does not match, it should contain the values from the file that have data and NaN for the other 2 column from the file that don't match. The data from the two input files should be sorted numerically based on the first column.
For example, if in the first file I have:
140 12.49
140 12.33
140 10.62
140 8.57
140 10.09
140 9.82
140 11.48
140 13.07
140 14.09
140 15.02
140 14.20
140 15.05
140 16.15
141 14.91
141 13.62
141 13.09
141 11.54
141 9.86
141 8.37
141 7.95
141 7.55
141 7.86
141 7.92
141 9.99
141 10.82
141 9.83
142 9.31
142 9.36
142 8.77
142 9.52
142 9.92
142 9.56
142 10.50
142 12.53
142 12.05
142 11.30
142 11.52
142 12.61
142 15.05
And on the second files I have:
140 12.70
140 12.64
140 10.52
140 8.61
140 10.11
140 9.88
140 11.48
140 13.12
140 14.16
140 15.07
140 14.25
140 15.16
140 16.25
141 14.84
142 9.54
142 8.74
142 9.68
142 10.00
142 9.64
142 10.64
142 12.69
142 12.15
142 11.51
142 11.69
142 12.77
142 15.25
I want the results to be:
140 12.49 140 12.70
140 12.33 140 12.64
140 10.62 140 10.52
140 8.57 140 8.61
140 10.09 140 10.11
140 9.82 140 9.88
140 11.48 140 11.48
140 13.07 140 13.12
140 14.09 140 14.16
140 15.02 140 15.07
140 14.20 140 14.25
140 15.05 140 15.16
140 16.15 140 16.25
141 14.91 141 14.84
141 13.62 141 NaN
141 13.09 141 NaN
141 11.54 141 NaN
141 9.86 141 NaN
141 8.37 141 NaN
141 7.95 141 NaN
141 7.55 141 NaN
141 7.86 141 NaN
141 7.92 141 NaN
141 9.99 141 NaN
141 10.82 141 NaN
141 9.83 141 NaN
142 9.31 142 9.54
142 9.36 142 8.74
142 8.77 142 9.68
142 9.52 142 10.00
142 9.92 142 9.64
142 9.56 142 10.64
142 10.50 142 12.69
142 12.53 142 12.15
142 12.05 142 11.51
142 11.30 142 11.69
142 11.52 142 12.77
142 12.61 142 15.25
142 15.05 142 NaN
In the case that the file 2 had the entries from files 1 that the output should be similar but the column 2 from the output should have NaN - for example:
140 15.16 140 15.05
140 16.25 140 16.15
141 14.84 141 14.91
141 NaN 141 13.62
141 NaN 141 13.09
.....................
142 15.25 142 12.61
142 NaN 142 15.05
awk 'NR==FNR{a[$1]=$2; next} {print $1, ($1 in a) ? a[$1] : "NaN", $1, $2}' file1.txt file2.txt
Thank you kindly for all the recommendations.
I would approach this as a merge join operation:
There are a few ways to do this in awk
awk
(half the max memory)getline
to process both files (minimal memory usage)One awk
idea using getline
to process both files:
$ cat merge_join.awk
BEGIN {
delete col1 # occasionally I run into cases where awk
delete col2 # confuses arrays for scalars; the "delete"
delete rc # serves to type these variables as arrays
while (1) {
if (! (1 in col1) ) { rc[1]=getline < f1 # if no entries in array for index "1" then read new line from file1 and ...
col1[1]=$1; col2[1]=$2 # store our columns in our arrays
}
if (! (2 in col1) ) { rc[2]=getline < f2 # if no entries in array for index "2" then read new line from file2 and ...
col1[2]=$1; col2[2]=$2 # store our columns in our arrays
}
if ( ! rc[1] && ! rc[2] ) break # if no more rows in both files then break out of the while loop
if ( ! rc[1] ) { col1[1]=col1[2]; col2[1]="NaN" } # if no more rows in file1 then default the col1[1] and col2[1] entries
if ( ! rc[2] ) { col1[2]=col1[1]; col2[2]="NaN" } # if no more rows in file2 then default the col1[2] and col2[2] entries
if ( col1[1] == col1[2] ) { print col1[1], col2[1], col1[2], col2[2]
delete col1; delete col2 # delete both col1[] and col2[] entries
}
else
if ( col1[1] < col1[2] ) { print col1[1], col2[1], col1[1], "NaN"
delete col1[1] # delete just the f1/file1 entry
}
else { print col1[2], "NaN" , col1[2], col2[2]
delete col1[2] # delete just the f2/file2 entry
}
} # while
} # BEGIN
NOTES: as currently coded the script will continue to run getline < fN
even after all rows have been read from fN
, though the return code (stored in rc[N]
will flag the operation as a failure); I've opted to continue operations based on the return codes (rc[N]
) instead of halting the associated getline < fN
Test data:
$ cat file1
130 130.130
130 131.132
140 140.140
140 141.141
140 142.142
$ cat file2
140 149.149
140 148.148
150 150.159
150 150.158
Running the script against the two test files:
$ awk -f merge_join.awk -v f1=file1 -v f2=file2
130 130.130 130 NaN
130 131.132 130 NaN
140 140.140 140 149.149
140 141.141 140 148.148
140 142.142 140 NaN
150 NaN 150 150.159
150 NaN 150 150.158