Search code examples
bashawkgreppattern-matchingvcf-variant-call-format

Grep multiple positions with/without ID


I want to grep a vcf file for search for multiple positions. The following works:

grep -f template_gb37 file.vcf>gb37_result

My template_gb37 has 10000 lines and it looks like this:

1   1156131 rs2887286   C   T
1   1211292 rs6685064   T   C
1   2283896 rs2840528   A   G

When the vcf has the rs it works perfect.

The problem is that the vcf I am going to grep may not have the rs and "." instead:

File.vcf

#CHROM  POS  ID  REF  ALT ....
1   1156131 .   C   T  ....
1   1211292 .   T   C  ....
1   1211292 .   T   C  ....

Is there a way to search my multiple patterns with "rs" or just "."?

Thanks in advance


Solution

  • I think you mean the second field in your file could be . or rsNNNNNN and you want to allow either. So, I think you need an "alternation" which you do with a | like this:

    printf "cat\nmonkey\ndog" | grep -E "cat|dog"
    cat
    dog
    

    So your pattern file "template_gb37" needs to look like this:

    1   1156131 (\.)|rs2887286   C   T
    1   1211292 (\.)|rs6685064   T   C
    1   2283896 (\.)|rs2840528   A   G
    

    And you need to search with:

    grep -Ef PATTERNFILE file.vcf
    

    If you don't want to change your pattern file, you can edit it "on-the-fly" each time you use it. So, if "template" currently looks like this:

    1   1156131 rs2887286   C   T
    1   1211292 rs6685064   T   C
    1   2283896 rs2840528   A   G
    

    the following awk will edit it:

    awk '{$3 = "(\\.)|" $3}1' template
    

    to make it this:

    1 1156131 (\.)|rs2887286 C T
    1 1211292 (\.)|rs6685064 T C
    1 2283896 (\.)|rs2840528 A G
    

    which means you could use my whole answer like this:

    grep -Ef <( awk '{$3 = "(\\.)|" $3}1' template ) file.vcf