I would like to filter a file so that I can obtain rows that match in column 1 and do not match in column 2. In the following example:
00b27c71-a833-4605-9fb3-a2714ac98092 ENST00000352983.6 157 60 16
00d77e65-466e-4fe6-ad0f-bc6b3f44af75 ENST00000367142.4 130 12 4
00d77e65-466e-4fe6-ad0f-bc6b3f44af75 ENST00000367142.4 8 60 0
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000258424.2 12 60 2048
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000352983.6 157 60 16
00d77e65-466e-4fe6-ad0f-bc6b3f44af74 ENST00000367142.5 130 12 4
00d77e65-466e-4fe6-ad0f-bc6b3f44af74 ENST00000367142.7 8 60 0
00d77e65-466e-4fe6-ad0f-bc6b3f44af74 ENST00000258424.2 8 60 0
I would like to find entires in column 1 that appear exactly twice, and that do NOT match in column 2, i.e. duplicates in the combiation column1,column2 should be ignored. So the expected output would be:
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000258424.2 12 60 2048
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000352983.6 157 60 16
What is in columns 3,4,5,etc is not important for filtering, but I do need to retain the information.
I also need to pipe this in from another output that is necessary to read the file and retain the header. So I need something in the format:
samtools view -h file.bam | code that I need > results.bam
I have tried several version of awk, but to no avail. Any help would be much appreciated.
I believe what you are after is the following:
awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0 }
END { for(i in a) {$0=i; if (b[$1]==2) print a[i] } }' file
This outputs :
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000352983.6 157 60 16
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000258424.2 12 60 2048
What it essentially does is it checks if the combination $1 FS $2
is in the array a
. If it is not, keep track of the count of $1
in b[$1]
and store the full line in a[$1 FS $2]
. In the end, print a[i]
if the count is correct. Remark that the key of b
is obtained by reassigning the key i
to $0
. This redefines the fields $1
and $2
and $1
is the key you wanted.
note: the above script does not necessarily keep track of the order as array traversal is done in an unspecified order. If you want to keep the order, you need to keep track of the line-index:
awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0; c[NR]=$1 FS $2 }
END { for(i=1;i<=NR;++i) if(i in c) { $0=c[i]; if (b[$1]==2) print a[$0]}
}' file
outputing:
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000258424.2 12 60 2048
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000352983.6 157 60 16
old answer:
awk '!($1 in a) { a[$1]=$2; b[$1]=$0; next }
!match($2,a[$1]){a[$1]=a[$1] FS $2; b[$1]=b[$1] ORS $0}
END { for (i in a) if (gsub(FS,FS,a[i]) == 1) print b[i] }' file
This outputs :
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000258424.2 12 60 2048
00b27c71-a833-4605-9fb3-a2714ac98091 ENST00000352983.6 157 60 16
What it essentially does is keep track of two arrays (a
and b
) both indexed by the first column. If, array a
does not contain the element of column $2
then it adds it to the string a[$1]
. It also stores the full line in b[$1]
separated by ORS
.
In the end, we count how many fields are in a[i]
, if it is two, print b[i]
.