Search code examples
regexperlawkgrepoverlapping

How to grep/perl/awk overlapping regex


Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":

Attempt using egrep (both on GNU and BSD):

$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim

Attempt using perl style grep (-P):

$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim

Attempt using awk showing only the first match:

$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary

The overlapping results I'd like to see from a simple working bash pipe command are:

bob mary
mary mike
mike bill
bill kim
kim jim
jim john

Any ideas?


Solution

  • Lookahead is your friend here

    echo "bob mary mike bill kim jim john" | 
        perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'
    

    The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.

    So as the regex engine matches a word and spaces ((\w+)\s+), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w, doesn't "consume" that next word, as they say.

    It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1 and $2, two words.

    Then, because of /g modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.

    See Lookahead and lookbehind assertions in perlretut