Search code examples
bashgrep

Does zgrep limit the amount of search patterns?


I have the two following commands.

# 6 million lines
"zgrep -oF -f /projects/lab.mis/index/temp_hg19.inx /projects/incoming/M1_R2.fastq.gz"
# 6 thousand lines 
"zgrep -oF -f /projects/lab.mis/index/optimize_hg19.inx /projects/incoming/M1_R2.fastq.gz"

The first searches through a file that has over 6 million patters while the second one only 6K.

Both files should contain the fixe string: "GATTCCAGATGGAGGT"

However only the second command, the one with 6K search terms, returns the match. Is there a reason for this? I do not see any error message per se so very confused.


Solution

  • [Part of] the string you expect to find may be part of [the concatenation of] other matching strings from the bigger file.

    For example:

    $ cat file
    FOOGATTCCAGATGGAGGTBAR
    

    with a small file of just the 1 string to match:

    $ cat strings1
    GATTCCAGATGGAGGT
    
    $ grep -Fof strings1 file
    GATTCCAGATGGAGGT
    

    and with a larger file that includes that string plus others:

    $ cat strings2
    GATTCCAGATGGAGGT
    FOOG
    
    $ grep -Fof strings2 file
    FOOG
    

    since grep reported a match on FOOG, the G that'd be the start of GATTCCAGATGGAGGT has been consumed by the match on FOOG and so is no longer in the buffer for grep to match, all that's left is ATTCCAGATGGAGGTBAR.

    If you want to find all matches you can do:

    $ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ( pos=index($0,str) ) print substr($0,pos,length(str))}' strings2 file
    GATTCCAGATGGAGGT
    FOOG
    

    (really zcat file | awk '...' for your gzipped file of course) but that'd obviously be slower than zgrep as it's doing more work than zgrep.