Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":
Attempt using egrep (both on GNU and BSD):
$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using perl style grep (-P):
$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using awk showing only the first match:
$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary
The overlapping results I'd like to see from a simple working bash pipe command are:
bob mary
mary mike
mike bill
bill kim
kim jim
jim john
Any ideas?
Lookahead is your friend here
echo "bob mary mike bill kim jim john" |
perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'
The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.
So as the regex engine matches a word and spaces ((\w+)\s+
), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w
, doesn't "consume" that next word, as they say.
It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1
and $2
, two words.
Then, because of /g
modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.