Search code examples
bashawksedlarge-data

Conversion table replace all elements in other file


I am trying to convert all ICD codes in a tab separated file to Phecodes (based on a ICD-Phecode conversion table tab separated file) for a biology bioinformatics project. I found a good starting point with the code from the below stackoverflow post:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

Replacing values in large table using conversion table

But I don't want "all values in the first column have been changed according to the conversion table" (above code) I want all values in all columns in 002.txt to be changed according to the conversion table ICD9toPhecode.txt and ICD10toPhecode.txt. So I changed the awk script to the below but it's not working it's not doing anything:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD9toPhecode.txt 002.txt
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD10toPhecode.txt 002.txt

The first column in ICD9toPhecode.txt and ICD10toPhecode.txt is the ICD9 or ICD10 code and the second column is the Phecode.

Every column in 002.txt is a ICD9 or ICD10 code.

EDIT: It's still not working How do I write to the file?

Here is the anonymized patient data 002.txt sample of ICD10 codes which is OLD_FILE

1   2   3   4   5   6   7   8
K40.9   K43.9   N20.0   N20.1   N23 N39.0   R69 Z88.1
B96.8   D12.6   E11.6   E87.6   I44.7   K40.9   K43.9   K52.9
NOT

Here is the conversion table (ICD10toPhecode.txt) or TABLE

icd10cm phecode
K40.9   550.1
K43.9   550.5
N20.0   594.1
N20.1   594.3
N23 594.8
N39.0   591
R69 1019
Z88.1   960.1
B96.8   041
D12.6   208
E11.6   250.2
E87.6   276.14
I44.7   426.32
K40.9   550.1
K43.9   550.5
K52.9   558
XNO    17

This is what I should get (ICD10 codes converted to Phecodes) (002_output.txt):

1   2   3   4   5   6   7   8
550.1   550.5   594.1   594.3   594.8   591 1019    960.1
041 208 250.2   276.14  426.32  550.1   550.5   558

But what I actually get in 002_output.txt is a repeat of 002.txt

What I need to know is how to change:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD9toPhecode.txt 002.txt
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD10toPhecode.txt 002.txt

Specifically change ICD10toPhecode.txt 002.txt

I need to write the output to 002_output.txt. It can't be as simple as

ICD10toPhecode.txt 002.txt > 002_output.txt

that outputs the same thing as 002.txt

TESTABLE TEST CASE (for tables see the code snippets I posted above with those names):

awk '
   # Ignore header
   NR==1{ next }
   # Load first file
   FNR==NR { a[$1]=$2; next }
   {
      # Foreach value
      for (i = 1; i <= $NR; ++i) {
          # if the value is in second file
          if ($i in a) {         
                # then replace it
                $i = a[$i]       # NOTE - $i __not__ $1 !
          }
      }
      # print it!
      print
   }
' ICD10toPhecode.txt 002.txt > 002_output.txt

BASED ON:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

I'm pretty sure in my TESTABLE TEST CASE I messed up my for loop maybe FNR==NR { a[$1]=$2; next } I need to link $1 ICD codes with $2 Phecodes in ICD10toPhecode.txt and replace the ICD codes with Phecodes in all fields in 002.txt (MORE THAN ONE COLUMN)


Solution

  • The bugs I see in your code are using $NR instead of NF in your loop, skipping the first line of the 2nd file instead of printing it as-is, and not using tabs as the in/out separators. This is apparently what you need:

    $ awk '
        BEGIN { FS=OFS="\t" }
        NR==FNR { map[$1]=$2; next }
        FNR>1 {
            for (i=1; i<=NF; i++) {
                if ($i in map) {
                    $i = map[$i]
                }
            }
        }
        { print }
    ' ICD10toPhecode.txt 002_ICD.txt
    1       2       3       4       5       6       7       8
    550.1   550.5   594.1   594.3   594.8   591     1019    960.1
    041     208     250.2   276.14  426.32  550.1   550.5   558