I am trying to using regex in Python to capture whole words from text. This is simple enough but I also want to remove contractions and possessives indicated by apostrophes.
Currently I have (?iu)(?<!')(?!n')[\w]+
Testing on the following text
One tree or many trees? My tree's green. I didn't figure this out yet.
Gives these matches
One tree or many trees My tree green I didn figure this out yet
In this example the negative lookbehind prevents the "s" and "t" after an apostrophe from being matched as whole words. But how do I write the negative lookahead (?!n')
so that the matches include "did" instead of "didn"?
(My use case here is a simple Python spell checker, each word gets validated as being spelt correctly or not. I've ended up using the autocorrect module as pyenchant, aspell-python and others didn't work when installed via pip)
I would use this regex:
(?<![\w'])\w+?(?=\b|n't)
This matches word characters until it encounters n't
.
Result:
>>> re.findall(r"(?<![\w'])\w+?(?=\b|n't)", "One tree or many trees? My tree's green. I didn't figure this out yet.")
['One', 'tree', 'or', 'many', 'trees', 'My', 'tree', 'green', 'I', 'did', 'figure', 'this', 'out', 'yet']
Breakdown:
(?<! # negative lookbehind: assert the text is not preceded by...
[\w'] # ... a word character or apostrophe
)
\w+? # match word characters, as few as necessary, until...
(?=
\b # ... a word boundary...
| # ... or ...
n't # ... the text "n't"
)