Search code examples
bashperlreplaceuniqueidentifiercol

bash merging tables on unique id


I have two similar, 'table format' text files, each several million records long. In the inputfile1, the unique identifier is a merger of values in two other columns (neither of which are unique identifiers on their own). In inputfile2, the unique identifier is two letters followed by a random four-digit number.

How can I replace the unique identifiers in inputfile1 with the corresponding unique identifiers in the inputfile2? All of the records in the first table are present in the second, though not vis versa. Below are toy examples of the files.

Input file 1:

Grp Len ident   data
A   20  A_20    3k3bj52
A   102 A_102   3k32rf2
A   352 A_352   3w3bj52
B   60  B_60    3k3qwrg
B   42  B_42    3kerj52
C   89  C_89    3kftj55
C   445 C_445   fy5763b

Input file 2:

Grp Len ident
A   20  fz2525
A   102 fz5367
A   352 fz4678
A   356 fz1543
B   60  fz5732
B   11  fz2121
B   42  fz3563
C   89  fz8744
C   245 fz2653
C   445 fz2985
C   536 fz8983

Desired output:

Grp Len ident   data
A   20  fz2525  3k3bj52
A   102 fz5367  3k32rf2
A   352 fz4678  3w3bj52
B   60  fz5732  3k3qwrg
B   42  fz3563  3kerj52
C   89  fz8744  3kftj55
C   445 fz2985  fy5763b

My provisional plan is:

  1. Generate extra identifiers for input2, in the style of input1 (easy)
  2. Filter out lines from input2 that don't occur input1 (hardish)
  3. Then stick on the data from input1 (easy)

I might be able to do this in R but the data is large and complex, and I was wondering if there was a way in bash or perl. Any tips in the right direction would be good.


Solution

  • This should work for you, assuming the Grp and Len values are in the same order in both files, as per my comment

    Essentially it reads a line from the first file and then reads from the second file, forming the Grp_Len key from each record until it finds an entry that matches. Then it's just a matter of building the new output record

    use strict;
    use warnings;
    
    open my $f1, '<', 'file1.txt';
    print scalar <$f1>;
    open my $f2, '<', 'file2.txt';
    <$f2>;
    
    while ( <$f1> ) {
    
        my @f1 = split;
    
        my @f2;
        while () {
            @f2 = split ' ', <$f2>;
            last if join('_', @f2[0,1]) eq $f1[2];
        }
    
        print "@f2 $f1[3]\n";
    }
    

    output

    Grp Len ident   data
    A 20 fz2525 3k3bj52
    A 102 fz5367 3k32rf2
    A 352 fz4678 3w3bj52
    B 60 fz5732 3k3qwrg
    B 42 fz3563 3kerj52
    C 89 fz8744 3kftj55
    C 445 fz2985 fy5763b
    



    Update

    Here's another version which is identical except that it builds a printf format string from the spacing of the column headers in the first file. That results in a much neater output

    use strict;
    use warnings;
    
    open my $f1, '<', 'file1.txt';
    my $head = <$f1>;
    print $head;
    my $format = create_format($head);
    
    open my $f2, '<', 'file2.txt';
    <$f2>;
    
    while ( <$f1> ) {
    
        my @f1 = split;
    
        my @f2;
        while () {
            @f2 = split ' ', <$f2>;
            last if join('_', @f2[0,1]) eq $f1[2];
        }
    
        printf $format, @f2, $f1[3];
    }
    
    sub create_format {
        my ($head) = @_;
        my ($format, $pos);
    
        while ( $head =~ /\b\S/g ) {
            $format .= sprintf("%%-%ds", $-[0] - $pos) if defined $pos;
            $pos = $-[0];
        }
    
        $format . "%s\n";
    }
    

    output

    Grp Len ident   data
    A   20  fz2525  3k3bj52
    A   102 fz5367  3k32rf2
    A   352 fz4678  3w3bj52
    B   60  fz5732  3k3qwrg
    B   42  fz3563  3kerj52
    C   89  fz8744  3kftj55
    C   445 fz2985  fy5763b