Search code examples
rregexperlregex-lookaroundslookbehind

Using look-ahead/behind on complex search in R/Perl regex


I can't figure out how to utilize lookaheads/behinds in a regular expression to find matches across individual search bits (?) of the word/motif I'm searching for.

In a set of DNA strings, I need to match TGGA + one C or T + 0-4 A/C/T/G + >= 5 C/T, but don't want a GT anywhere in the match. I've figured out how to eliminate this within the 0-4 A/C/T/G (example #1), but I can't figure out how to deal with cases where the G comes from the [A,C,T,G]{0,4} and the adjacent T comes from the {5,}.

I've tried adding a look behind after expanding the last part to [C,T](?>!GT)[C,T]{4,} and the look behind in front of the [A,C,T,G]{0,4} doesn't pick up the split GT instance. Any tips/help would be appreciated!

Current regex:

TGGA[C,T](?!GT)[A,C,T,G]{0,4}[C,T]{5,}

Example set:
1) TGGACGTGGTCCCCC (bad, dealt with)
2) TGGACGCCCCC (good)
3) TGGACGGGGTCCCCC... (bad, how do I fix this?)


Solution

  • Use a negative lookahead after the relevant G characters to indicate that a T should not follow:

    /TGGA[CT](?:[ACT]|G(?!T)){0,4}[CT]{5,}/