Search code examples
awksamtools

Filter a file to find rows that match in one column but differ in another column


I would like to filter a file so that I can obtain rows that match in column 1 and do not match in column 2. In the following example:

00b27c71-a833-4605-9fb3-a2714ac98092    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   8   60 0
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.5   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.7   8   60 0
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000258424.2   8   60 0

I would like to find entires in column 1 that appear exactly twice, and that do NOT match in column 2, i.e. duplicates in the combiation column1,column2 should be ignored. So the expected output would be:

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

What is in columns 3,4,5,etc is not important for filtering, but I do need to retain the information.

I also need to pipe this in from another output that is necessary to read the file and retain the header. So I need something in the format:

samtools view -h file.bam | code that I need > results.bam

I have tried several version of awk, but to no avail. Any help would be much appreciated.


Solution

  • I believe what you are after is the following:

    awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0 }
         END { for(i in a) {$0=i; if (b[$1]==2) print a[i] } }' file
    

    This outputs :

    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
    

    What it essentially does is it checks if the combination $1 FS $2 is in the array a. If it is not, keep track of the count of $1 in b[$1] and store the full line in a[$1 FS $2]. In the end, print a[i] if the count is correct. Remark that the key of b is obtained by reassigning the key i to $0. This redefines the fields $1 and $2 and $1 is the key you wanted.

    note: the above script does not necessarily keep track of the order as array traversal is done in an unspecified order. If you want to keep the order, you need to keep track of the line-index:

    awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0; c[NR]=$1 FS $2 }
         END { for(i=1;i<=NR;++i) if(i in c) { $0=c[i]; if (b[$1]==2) print a[$0]}
         }' file
    

    outputing:

    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
    

    old answer:

    awk '!($1 in a) { a[$1]=$2; b[$1]=$0; next }
         !match($2,a[$1]){a[$1]=a[$1] FS $2; b[$1]=b[$1] ORS $0}
         END { for (i in a) if (gsub(FS,FS,a[i]) == 1) print b[i] }' file
    

    This outputs :

    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
    00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
    

    What it essentially does is keep track of two arrays (a and b) both indexed by the first column. If, array a does not contain the element of column $2 then it adds it to the string a[$1]. It also stores the full line in b[$1] separated by ORS. In the end, we count how many fields are in a[i], if it is two, print b[i].