Search code examples
bashgrepbackreferenceword-boundary

grep- why must there be word boundaries around back references?


I'm just curious why grep matches things in this way.

For example, let's say I'm trying to find a word that occurs twice in a sentence (and not as parts of other words). So I'm trying to find lines like the following :

hello everybody hello

and not like the following :

hello everybody hellopeople 

Then why does the following grep expression work :

grep -E '(\<.*\>).*\<\1\>' file

and not the following :

grep -E '(\<.*\>).*\1' file

I would have thought that the second one would work because the word boundaries (\< and \>) are inside the parentheses for the second match, but it doesn't. It just seems rather confusing that one has to put word boundaries around the back reference, can someone explain why grep matches lines in this way, or maybe elaborate on this idea further?


Solution

  • zero width assertion/zero length match cannot be captured in capture group. \b or \< \> are zero length match. It cannot be captured in group. Same as zero width assertion like look behind/ahead.

    for example:

    ((?<=#)\w+(?=#)).*\1
    

    will match string

    #hello# everybody hellofoo
    

    P.S. you may want to use \w+ instead of .* inside your word boundaries.