I am looking for an efficient solution to filter two datasets. Basically, I want to keep only lines that are not missing in one file for their key columns and do not have values "0/0" in both files.
The input data (for those interested, a genomic VCF file that I have simplified for this question) has the following characteristics:
The script would ideally do the following:
sample1.dat
1 1001 0/0:8:8:99:PASS
1 1002 0/0:8:8:99:PASS
1 1003 0/1:5,3:8:99:PASS,PASS
2 1234 0/0:8:8:99:PASS # not present in sample2
2 2345 1/1:8:8:99:PASS
sample2.dat
1 1001 0/0:8:8:99:PASS
1 1002 0/1:5,3:8:99:PASS,PASS
1 1003 0/0:8:8:99:PASS
2 2345 1/1:8:8:99:PASS
2 3456 0/1:8:8:99:PASS # not present in sample1
sample1_out.dat
1 1002 0/0:8:8:99:PASS
1 1003 0/1:5,3:8:99:PASS,PASS
2 2345 1/1:8:8:99:PASS
sample2_out.dat
1 1002 0/1:5,3:8:99:PASS,PASS
1 1003 0/0:8:8:99:PASS
2 2345 1/1:8:8:99:PASS
In this case, 1-1001 is not printed because they both have values "0/0", and 2-1234 and 2-3456 are not printed because they are not present in both files.
Some notes:
Any help is greatly appreciated!
awk
to the rescue! Probably you need to split the files first, for each chunk you can do this
$ function f { awk -v OFS='\t' '{print $1"~"$2,$0}' $1; };
join <(f file1) <(f file2) |
awk -v OFS='\t' '$4!~/0\/0/ || $7!~/0\/0/
{print $2,$3,$4 > "file1.out";
print $5,$6,$7 > "file2.out"}'
Explanation Let join
do the work of matching the corresponding records, but need to create a synthetic key by merging first two fields. The output carries all the info we need, apply your "0/0" logic in the corresponding fields and output sections of the result to the corresponding output files.
$ head file{1,2}.out
==> file1.out <==
1 1002 0/0:8:8:99:PASS
1 1003 0/1:5,3:8:99:PASS,PASS
2 2345 1/1:8:8:99:PASS
==> file2.out <==
1 1002 0/1:5,3:8:99:PASS,PASS
1 1003 0/0:8:8:99:PASS
2 2345 1/1:8:8:99:PASS
You will probably better off parameterizing file names, both in and out.