using regex to find substring

I am facing a problem with regex usage. I am using the following regex:

\\S*the[^o\\s]*(?<!theo)\\b

The sentence that I am using is:

If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.

What i want from output is to have patterns: the, then, thetatheder, extratheaterly?

So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.

How can I modify my regex to achieve this? What I am thinking is to apply, pipe operation or question mark. But none of them seems to be feasible.

Solution

You might use the \S in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.

To match The or the you could make the pattern case insensitive.

(?<!\S)(?!\S*theo\S*)\S*the\S*

In parts

(?<!\S) Negative lookbehind, assert what is on the left is not a non whitspace char
(?!\S*theo\S*) Negative lookahead, assert what is on the right does not contain theo
\S*the\S* Match the surrounded by matching 0+ times a non whitespace char

If you are only using word characters, you could also make use of word boundaries \b

\b(?!\w*theo\w*)\w*the\w*\b

Or you might assert that a part of the word is the and match it using an assertion that if you match a t it should not be followed by heo

\b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b