Search code examples
performancememorygrepdiskhard-drive

How long can I expect grep to take on a 10 TB file?


I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:

grep "cappucino" filename

I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.

Please correct me if I'm wrong:

I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?


Solution

  • The short answer is: no.

    The longer answer is: it depends.

    The even longer answer is: grep's performance depends on a lot of things:

    • are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
    • how long is your pattern - BM works faster for longer patterns
    • how many matches will you have - the less the matches the faster it will be
    • what is your CPU and memory - hard drive will help you only during reading not during computation time
    • what other options are you using with your grep
    • 14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.

    For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:

    With 808320 hits

    real    0m1.734s
    user    0m1.334s
    sys 0m0.120s
    

    With 0 hits:

    real    0m0.059s
    user    0m0.046s
    sys 0m0.016s
    

    @Edit: in short read about Boyer-Moore :-)

    @Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.