I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320
hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0
hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
@Edit: in short read about Boyer-Moore :-)
@Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.