I am trying to look for words that do not immediately come before the
.
Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W)
. However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.
I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).
p = re.compile(r'(?<=the\W)\w+')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
The current output am getting is
'part','fair','most','fireworks'.
Edit:
Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.
p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
This brings me closer to the output I need to get.
Updated Output:
[('part', ' of the'), ('fair', ''),
('that', ' attracts the'), ('most', ''),
('people', ' is the'), ('fireworks', '')]
I just need the strings ('part','fair','that','most','people','fireworks'). Any advise?
I have finally solved the question. Thank you all!
p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Added a non-capturing group '?:' inside the third group.
Output:
['part', 'fair', 'that', 'most', 'people', 'fireworks']