Search code examples
pythonregexspell-checking

Python regex to match whole words (minus contractions and possessives)


I am trying to using regex in Python to capture whole words from text. This is simple enough but I also want to remove contractions and possessives indicated by apostrophes.

Currently I have (?iu)(?<!')(?!n')[\w]+

Testing on the following text

One tree or many trees? My tree's green. I didn't figure this out yet.

Gives these matches

One tree or many trees My tree green I didn figure this out yet

In this example the negative lookbehind prevents the "s" and "t" after an apostrophe from being matched as whole words. But how do I write the negative lookahead (?!n') so that the matches include "did" instead of "didn"?

(My use case here is a simple Python spell checker, each word gets validated as being spelt correctly or not. I've ended up using the autocorrect module as pyenchant, aspell-python and others didn't work when installed via pip)


Solution

  • I would use this regex:

    (?<![\w'])\w+?(?=\b|n't)
    

    This matches word characters until it encounters n't.

    Result:

    >>> re.findall(r"(?<![\w'])\w+?(?=\b|n't)", "One tree or many trees? My tree's green. I didn't figure this out yet.")
    ['One', 'tree', 'or', 'many', 'trees', 'My', 'tree', 'green', 'I', 'did', 'figure', 'this', 'out', 'yet']
    

    Breakdown:

    (?<!         # negative lookbehind: assert the text is not preceded by...
        [\w']    # ... a word character or apostrophe
    )
    \w+?         # match word characters, as few as necessary, until...
    (?=
        \b       # ... a word boundary...
    |            # ... or ...
        n't      # ... the text "n't"
    )