perl unix awk bioinformatics vcf-variant-call-format

Find matching key in two large sorted text files and compare values (VCF-files)

I am looking for an efficient solution to filter two datasets. Basically, I want to keep only lines that are not missing in one file for their key columns and do not have values "0/0" in both files.

The input data (for those interested, a genomic VCF file that I have simplified for this question) has the following characteristics:

Column 1 and 2 together are numerically sorted, unique identifiers
Column 3 starts with values 0/0, 0/1 or 1/1

The script would ideally do the following:

Iterate over every line in sample1.dat and look up the same identifier in sample2.dat
If an identifier from sample1.dat is not found in sample2.dat, do nothing
If both lines contain "0/0", do nothing
If one or both lines contain not "0/0", write both lines to their respective outputs .

INPUT

sample1.dat

1      1001  0/0:8:8:99:PASS
1      1002  0/0:8:8:99:PASS
1      1003  0/1:5,3:8:99:PASS,PASS
2      1234  0/0:8:8:99:PASS # not present in sample2
2      2345  1/1:8:8:99:PASS

sample2.dat

1      1001   0/0:8:8:99:PASS
1      1002   0/1:5,3:8:99:PASS,PASS
1      1003   0/0:8:8:99:PASS
2      2345   1/1:8:8:99:PASS
2      3456   0/1:8:8:99:PASS # not present in sample1

OUTPUT

sample1_out.dat

1      1002  0/0:8:8:99:PASS
1      1003  0/1:5,3:8:99:PASS,PASS
2      2345  1/1:8:8:99:PASS

sample2_out.dat

1      1002   0/1:5,3:8:99:PASS,PASS
1      1003   0/0:8:8:99:PASS
2      2345   1/1:8:8:99:PASS

In this case, 1-1001 is not printed because they both have values "0/0", and 2-1234 and 2-3456 are not printed because they are not present in both files.

Some notes:

The files are about 260GB, but I can easily split them into multiple files that are max 18GB (I split them in chromosomes, basically)
Memory available on my machine is about 128GB
Column 1 and 2 are together already sorted in numerical order

Any help is greatly appreciated!

Solution

awk to the rescue! Probably you need to split the files first, for each chunk you can do this

$ function f { awk -v OFS='\t' '{print $1"~"$2,$0}' $1; }; 
  join <(f file1) <(f file2) | 
  awk -v OFS='\t' '$4!~/0\/0/ || $7!~/0\/0/ 
                     {print $2,$3,$4 > "file1.out"; 
                      print $5,$6,$7 > "file2.out"}'

Explanation Let join do the work of matching the corresponding records, but need to create a synthetic key by merging first two fields. The output carries all the info we need, apply your "0/0" logic in the corresponding fields and output sections of the result to the corresponding output files.

$ head file{1,2}.out                          

==> file1.out <==                                                                                                     
1       1002    0/0:8:8:99:PASS
1       1003    0/1:5,3:8:99:PASS,PASS
2       2345    1/1:8:8:99:PASS

==> file2.out <==
1       1002    0/1:5,3:8:99:PASS,PASS
1       1003    0/0:8:8:99:PASS
2       2345    1/1:8:8:99:PASS

You will probably better off parameterizing file names, both in and out.