Search code examples
bashunixawksedlarge-data

Replacing values in large table using conversion table


I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:

Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:

COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A 
rs1000000 12 125456933 G A 
rs10000010 4 21227772 T C 
rs10000012 4 1347325 G C 
rs10000013 4 36901464 C A 
rs10000017 4 84997149 T C 
rs1000002 3 185118462 T C 
rs10000023 4 95952929 T G 
...

I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:

COL1_b36       COL2_b37
rs10    7_92383888
rs1000000       12_126890980
rs10000010      4_21618674
rs10000012      4_1357325
rs10000013      4_37225069
rs10000017      4_84778125
rs1000002       3_183635768
rs10000023      4_95733906
...

The desired output would be a file where all values in the first column have been changed according to the conversion table:

COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A 
12_126890980 12 125456933 G A 
4_21618674 4 21227772 T C 
4_1357325 4 1347325 G C 
4_37225069 4 36901464 C A 
4_84778125 4 84997149 T C 
3_183635768 3 185118462 T C 
4_95733906 4 95952929 T G 
...

Additional info:

  • Performance is an issue (the following command takes approximately a year:

    while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE

  • A complete match is necessary before replacing
  • Not every value in the OLD_FILE can be found in the conversion table...
  • ...but every value that could be replaced, can be found in the conversion table.

Any help is very much appreciated.


Solution

  • Here's one way using awk:

    awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE
    

    Results:

    COL1 COL2 COL3 COL4 COL5
    7_92383888 7 92221824 C A
    12_126890980 12 125456933 G A
    4_21618674 4 21227772 T C
    4_1357325 4 1347325 G C
    4_37225069 4 36901464 C A
    4_84778125 4 84997149 T C
    3_183635768 3 185118462 T C
    4_95733906 4 95952929 T G
    

    Explanation, in order of appearance:

    NR==1 { next }            # simply skip processing the first line (header) of
                              # the first file in the arguments list (TABLE)
    
    FNR==NR { ... }           # This is a construct that only returns true for the
                              # first file in the arguments list (TABLE)
    
    a[$1]=$2                  # So when we loop through the TABLE file, we add the
                              # column one to an associative array, and we assign
                              # this key the value of column two
    
    next                      # This simply skips processing the remainder of the
                              # code by forcing awk to read the next line of input
    
    $1 in a { ... }           # Now when awk has finished processing the TABLE file,
                              # it will begin reading the second file in the
                              # arguments list which is OLD_FILE. So this construct
                              # is a condition that returns true literally if column
                              # one exists in the array
    
    $1=a[$1]                  # re-assign column one's value to be the value held
                              # in the array
    
    1                         # The 1 on the end simply enables default printing. It
                              # would be like saying: $1 in a { $1=a[$1]; print $0 }'