Search code examples
pythonregexmachine-learningnlpartificial-intelligence

using regex to find substring


I am facing a problem with regex usage. I am using the following regex:

\\S*the[^o\\s]*(?<!theo)\\b

The sentence that I am using is:

If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.

What i want from output is to have patterns: the, then, thetatheder, extratheaterly?

So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.

How can I modify my regex to achieve this? What I am thinking is to apply, pipe operation or question mark. But none of them seems to be feasible.


Solution

  • You might use the \S in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.

    To match The or the you could make the pattern case insensitive.

    (?<!\S)(?!\S*theo\S*)\S*the\S*
    

    In parts

    • (?<!\S) Negative lookbehind, assert what is on the left is not a non whitspace char
    • (?!\S*theo\S*) Negative lookahead, assert what is on the right does not contain theo
    • \S*the\S* Match the surrounded by matching 0+ times a non whitespace char

    Regex demo

    If you are only using word characters, you could also make use of word boundaries \b

    \b(?!\w*theo\w*)\w*the\w*\b
    

    Regex demo

    Or you might assert that a part of the word is the and match it using an assertion that if you match a t it should not be followed by heo

    \b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b
    

    Regex demo