I have the two following commands.
# 6 million lines
"zgrep -oF -f /projects/lab.mis/index/temp_hg19.inx /projects/incoming/M1_R2.fastq.gz"
# 6 thousand lines
"zgrep -oF -f /projects/lab.mis/index/optimize_hg19.inx /projects/incoming/M1_R2.fastq.gz"
The first searches through a file that has over 6 million patters while the second one only 6K.
Both files should contain the fixe string: "GATTCCAGATGGAGGT"
However only the second command, the one with 6K search terms, returns the match. Is there a reason for this? I do not see any error message per se so very confused.
[Part of] the string you expect to find may be part of [the concatenation of] other matching strings from the bigger file.
For example:
$ cat file
FOOGATTCCAGATGGAGGTBAR
with a small file of just the 1 string to match:
$ cat strings1
GATTCCAGATGGAGGT
$ grep -Fof strings1 file
GATTCCAGATGGAGGT
and with a larger file that includes that string plus others:
$ cat strings2
GATTCCAGATGGAGGT
FOOG
$ grep -Fof strings2 file
FOOG
since grep
reported a match on FOOG
, the G
that'd be the start of GATTCCAGATGGAGGT
has been consumed by the match on FOOG
and so is no longer in the buffer for grep
to match, all that's left is ATTCCAGATGGAGGTBAR
.
If you want to find all matches you can do:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ( pos=index($0,str) ) print substr($0,pos,length(str))}' strings2 file
GATTCCAGATGGAGGT
FOOG
(really zcat file | awk '...'
for your gzipped file of course) but that'd obviously be slower than zgrep
as it's doing more work than zgrep
.