Search code examples
awkvlookup

awk script to match value in one of the csv column in a tab-delimited file


I have tab-delimited file and one of the column($10) is separated by commas if more than one value exists.

I have a list of IDs that I want to match with the above file and find which ID is present. I need code that acknowledges the tab separation and comma separation to find the ID and report "ID_present" in the last column delimited by tab

File1

KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707300.1     
KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1    
KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1    
KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,KK2_0007890345.1    

File 2

KK2_0006707310.1
KK2_0006707300.1

Outfile

KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707300.1     ID_Present
KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1    ID_present
KK2        62652   63579   m64204e     -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1    ID_present

Awk_Script

awk 'BEGIN {FS="\t,"} {OFS="\t"} NR==FNR {a[$10]=$1 ; b[$10]=$2; c[$10]=$3; d[$1]=$n; next} {if ($1 in a) {print $0, "ID_present} else {print $0, "#N/A"} }' file2 file1 > outfile

Solution

  • This is probably what you're trying to do, assuming the expected output in your question is wrong:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    NR==FNR {
        tgts[$1]
        next
    }
    {
        stat = "N/A"
        split($10,vals,/[=,]/)
        for ( i=2; i in vals; i++ ) {
            val = vals[i]
            if ( val in tgts ) {
                stat = "ID_present"
                break
            }
        }
        print $0, stat
    }
    

    $ awk -f tst.awk file2 file1
    KK2     62652   63579   m64204e -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707300.1 ID_present
    KK2     62652   63579   m64204e -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1 ID_present
    KK2     62652   63579   m64204e -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1 ID_present
    KK2     62652   63579   m64204e -       polypeptide     61932   64190   A_KK2_000063400 orthologous_to=KK2_0006707310.1,KK2_0007890345.1        ID_present