I have two similar, 'table format' text files, each several million records long. In the inputfile1, the unique identifier is a merger of values in two other columns (neither of which are unique identifiers on their own). In inputfile2, the unique identifier is two letters followed by a random four-digit number.
How can I replace the unique identifiers in inputfile1 with the corresponding unique identifiers in the inputfile2? All of the records in the first table are present in the second, though not vis versa. Below are toy examples of the files.
Input file 1:
Grp Len ident data
A 20 A_20 3k3bj52
A 102 A_102 3k32rf2
A 352 A_352 3w3bj52
B 60 B_60 3k3qwrg
B 42 B_42 3kerj52
C 89 C_89 3kftj55
C 445 C_445 fy5763b
Input file 2:
Grp Len ident
A 20 fz2525
A 102 fz5367
A 352 fz4678
A 356 fz1543
B 60 fz5732
B 11 fz2121
B 42 fz3563
C 89 fz8744
C 245 fz2653
C 445 fz2985
C 536 fz8983
Desired output:
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
My provisional plan is:
I might be able to do this in R but the data is large and complex, and I was wondering if there was a way in bash or perl. Any tips in the right direction would be good.
This should work for you, assuming the Grp
and Len
values are in the same order in both files, as per my comment
Essentially it reads a line from the first file and then reads from the second file, forming the Grp_Len
key from each record until it finds an entry that matches. Then it's just a matter of building the new output record
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
print scalar <$f1>;
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my @f1 = split;
my @f2;
while () {
@f2 = split ' ', <$f2>;
last if join('_', @f2[0,1]) eq $f1[2];
}
print "@f2 $f1[3]\n";
}
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
Here's another version which is identical except that it builds a printf
format string from the spacing of the column headers in the first file. That results in a much neater output
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
my $head = <$f1>;
print $head;
my $format = create_format($head);
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my @f1 = split;
my @f2;
while () {
@f2 = split ' ', <$f2>;
last if join('_', @f2[0,1]) eq $f1[2];
}
printf $format, @f2, $f1[3];
}
sub create_format {
my ($head) = @_;
my ($format, $pos);
while ( $head =~ /\b\S/g ) {
$format .= sprintf("%%-%ds", $-[0] - $pos) if defined $pos;
$pos = $-[0];
}
$format . "%s\n";
}
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b