Search code examples
pythonpython-relookbehind

How to check for words that are not immediately followed by a keyword, how about words not surrounded by the keyword?


I am trying to look for words that do not immediately come before the.

Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W). However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.

I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).

p = re.compile(r'(?<=the\W)\w+') 
m = p.findall('the part of the fair that attracts the most people is the fireworks')

print(m)

The current output am getting is

'part','fair','most','fireworks'. 

Edit:

Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.

p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')

This brings me closer to the output I need to get.

Updated Output:

[('part', ' of the'), ('fair', ''),
 ('that', ' attracts the'), ('most', ''),
 ('people', ' is the'), ('fireworks', '')]

I just need the strings ('part','fair','that','most','people','fireworks'). Any advise?


Solution

  • I have finally solved the question. Thank you all!

    p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
    m = p.findall('the part of the fair that attracts the most people is the fireworks')
    print(m)
    
    
    

    Added a non-capturing group '?:' inside the third group.

    Output:

    ['part', 'fair', 'that', 'most', 'people', 'fireworks']