I'm just curious why grep matches things in this way.
For example, let's say I'm trying to find a word that occurs twice in a sentence (and not as parts of other words). So I'm trying to find lines like the following :
hello everybody hello
and not like the following :
hello everybody hellopeople
Then why does the following grep expression work :
grep -E '(\<.*\>).*\<\1\>' file
and not the following :
grep -E '(\<.*\>).*\1' file
I would have thought that the second one would work because the word boundaries (\< and \>) are inside the parentheses for the second match, but it doesn't. It just seems rather confusing that one has to put word boundaries around the back reference, can someone explain why grep matches lines in this way, or maybe elaborate on this idea further?
zero width assertion/zero length match cannot be captured in capture group. \b or \< \>
are zero length match. It cannot be captured in group. Same as zero width assertion like look behind/ahead.
for example:
((?<=#)\w+(?=#)).*\1
will match string
#hello# everybody hellofoo
P.S. you may want to use \w+
instead of .*
inside your word boundaries.