Search code examples
linuxbashgrepinternals

Grep internal working principle


I want to know how grep works internally. Specifically I want to know whether finding the first match is significantly more faster than finding all matches? For example, the first match occurs at the 10% point of the file from start and all matches spread all over the file. Then I think finding the first match only will make grep process much less file content than finding all matches (in this case grep must traverse the whole file, compared to 10% of file in the earlier case). I want to know whether my assumption is correct because this possible improvement can vastly improve my processing work. Thanks.


Solution

  • If you're using grep to print all matching lines from the file, then of course it has to process the entire file.

    On the other hand, if you use grep -q to produce a successful termination status if at least one match is found, then of course grep can stop at the first match. If the first match is found early in the file, then that saves time because grep can immediately exit at that point and return a successful termination status. If no match occurs in the file (worst case), then it has to process the entire file. It must process the entire file in this case because, how could it be sure there is no match? If a match occurs only in the very last line, but grep ignores that line, then it will wrongly report that there is no match.

    Grep compiles patterns to regular expressions. There are are performance implications regarding how regular expressions are structured. Some regular expressions perform better than others. Depending on the algorithm used, some regular expressions that appear small can generate state machines with a large number of states.

    A techique to speed up searching is indexing. If you're often looking for specific words in a corpus of text, it's faster if you have an index of the words which indicates the locations where they are found in the corpus. The index is organized in such a way that the list of locations where the word is found is retrieved very quickly, without scanning the text. It takes time to build the index (requiring a complete scan of the entire body of text), and the index has to be rebuilt when the corpus changes.

    This is the basis for tools that speed up identifier searching over computer program source code, such as GNU Id-Utils. And of course, indexing is the basis for World-Wide-Web search engines like Google.