Search code examples
bashawkoverlap

Bash/Awk: Find common translocations in two files using overlapping coordinates


I would like to compare two files to identified common translocations. However, these translocations don't have exactly the same coordinates between the files. So I want to see if the translocation occurs between the same pair of chromosomes (chr1, chr2) and if the coordinates overlap.

Here is an examples for two files:

file_1.txt:

chr1 min1 max1 chr2 min2 max2
1 111111 222222 2 333333 444444
2 777777 888888 3 555555 666666
15 10 100 15 2000 2100
17 500 530 18 700 750   
20 123456 234567 20 345678 456789

file_2.txt:

chr1 min1 max1 chr2 min2 max2
1 100000 200000 2 400000 500000
2 800000 900000 3 500000 600000
15 200 300 15 2000 3000
20 150000 200000 20 300000 500000

The objective is that the pair chr1 and chr2 is the same between file 1 and file 2. Then the coordinates min1 and max1 must overlap between the two files. Same thing for min2 and max2.

For the result, perhaps the best solution is to print the two lines as follows:

1   111111  222222  2   333333  444444
1   100000  200000  2   400000  500000

2   777777  888888  3   555555  666666
2   800000  900000  3   500000  600000

20  123456  234567  20  345678  456789
20  150000  200000  20  300000  500000

(For this simplified example, I tried to represent the different types of overlap I could encounter. I hope it is clear enough).

Thank you for your help.


Solution

  • awk to the rescue!

    $ awk 'function overlap(x1,y1,x2,y2) {return y1>x2 && y2>x1}
                 {k=$1 FS $4}
         NR==FNR {r[k]=$0; c1min[k]=$2; c1max[k]=$3; c2min[k]=$5; c2max[k]=$6; next}
         overlap(c1min[k],c1max[k],$2,$3) &&
         overlap(c2min[k],c2max[k],$5,$6) {print r[k] ORS $0 ORS}' file1 file2
    
    1 111111 222222 2 333333 444444
    1 100000 200000 2 400000 500000
    
    2 777777 888888 3 555555 666666
    2 800000 900000 3 500000 600000
    
    20 123456 234567 20 345678 456789
    20 150000 200000 20 300000 500000
    

    assumes the first file can be held in memory and prints an extra empty line at the end.