Search code examples
grep

How grep selects a pattern for a line if several patterns match one line at once, and they intersect


Consider this example. I write grep -e aboba -e ba test/test4.txt. In this file I have only one line: aboba aboba aboba aboba aboba baboba. I thought grep traversed all regular expressions in a specific order. Those. For each line, it first considers the first regular expression, then the next one, and so on. enter image description here

However, in this example my logic broke down. grep will recognize the first 5 words as aboba, and in the last word it will find two occurrences of ba. Why is that? It seems to me that the result should be either aboba 6 times, or ba in the first 5 words, and ba 2 times in the last word.


Solution

  • In aboba the first letter of the word starts a match on one of the patterns(aboba). The parser tracks that match through each letter, registers it, then moves on to the next character until it fails, or confirms a complete match - in this case, aboba matches.

    In baboba the first letter of the word starts a match on one of the patterns(ba). The parser tracks that match through each letter, registers it, then moves on to the next character until it fails, or confirms a complete match - in this case, the match was ba.

    Having completed a match that can't be part of a longer match it starts over at the next character, and the next character is b, which might be a new match(of ba), so it checks o and fails. AFAIK, backtracking won't re-enter a marked-as-finished pattern without lookbehind, so it continues to the next character, another b - the first character of another possible ba match.

    It continues to find a a, completing a match for another ba, registers that, and continues to the next character.

    This process effectively consumed the first a of the possible aboba in a successful match of ba, so most implementations never bother to check for the possibility of overlapping patterns without explicit lookahead/lookbehind.

    (The amount of backtracking that might trigger could exponentially increase the work it would have to do, as well as radically complicating the CLI options needed to establish which behavior the user wanted, not to mention that a large number of patterns - you can load big files of possible matches to be searched - might leave the program backtracking and testing one record till the heat death of the universe, never mind the possible millions of records grep is commonly asked to scan...)

    So as far as I know, no overlaps without lookahead or lookbehind, or custom implementations. c.f. this question for a couple of examples.

    Consider the way your given patterns interact, and look at these examples:

    $: grep -o -e ababa -e ba <<< abababababaabababa
    ababa
    ba
    ba
    ba
    ababa
    ba
    
    $: grep -o -e xxx -e xxxx <<< xxxxxxxxxxxxxxx
    xxxx
    xxxx
    xxxx
    xxx