I am facing a problem with regex usage. I am using the following regex:
\\S*the[^o\\s]*(?<!theo)\\b
The sentence that I am using is:
If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.
What i want from output is to have patterns: the, then, thetatheder, extratheaterly?
So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.
How can I modify my regex to achieve this? What I am thinking is to apply, pipe operation or question mark. But none of them seems to be feasible.
You might use the \S
in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.
To match The or the you could make the pattern case insensitive.
(?<!\S)(?!\S*theo\S*)\S*the\S*
In parts
(?<!\S)
Negative lookbehind, assert what is on the left is not a non whitspace char(?!\S*theo\S*)
Negative lookahead, assert what is on the right does not contain theo
\S*the\S*
Match the
surrounded by matching 0+ times a non whitespace charIf you are only using word characters, you could also make use of word boundaries \b
\b(?!\w*theo\w*)\w*the\w*\b
Or you might assert that a part of the word is the
and match it using an assertion that if you match a t
it should not be followed by heo
\b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b