I have tab-delimited file and one of the column($10) is separated by commas if more than one value exists.
I have a list of IDs that I want to match with the above file and find which ID is present. I need code that acknowledges the tab separation and comma separation to find the ID and report "ID_present" in the last column delimited by tab
File1
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707300.1
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,KK2_0007890345.1
File 2
KK2_0006707310.1
KK2_0006707300.1
Outfile
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707300.1 ID_Present
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1 ID_present
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1 ID_present
Awk_Script
awk 'BEGIN {FS="\t,"} {OFS="\t"} NR==FNR {a[$10]=$1 ; b[$10]=$2; c[$10]=$3; d[$1]=$n; next} {if ($1 in a) {print $0, "ID_present} else {print $0, "#N/A"} }' file2 file1 > outfile
This is probably what you're trying to do, assuming the expected output in your question is wrong:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
tgts[$1]
next
}
{
stat = "N/A"
split($10,vals,/[=,]/)
for ( i=2; i in vals; i++ ) {
val = vals[i]
if ( val in tgts ) {
stat = "ID_present"
break
}
}
print $0, stat
}
$ awk -f tst.awk file2 file1
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707300.1 ID_present
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1 ID_present
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,RR2_000678900.1,RR2_000567908.1 ID_present
KK2 62652 63579 m64204e - polypeptide 61932 64190 A_KK2_000063400 orthologous_to=KK2_0006707310.1,KK2_0007890345.1 ID_present