Search code examples
pythonregexnltktokenize

Regex: Only want space character before and after match


I am using Regex tokenizer for a text passage, and I would like to extract all words that only have white space before and after that. Here is my code:

tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')

For instance, the sentence "we don't have 500 dollars" will end up becoming "we don have dollars". I would like to have "don" eliminated since it does not end with a whitespace. How do I do so?


Solution

  • You can use positive lookahead and lookbehind to achieve this

    Code:

    import re

    pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
    print(re.findall(pattern, "we don't have 500 dollars"))
    print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))
    

    Output:

    ['we', 'have', '500', 'dollars']
    ['Your', 'no', 'good', 'Torrance']
    

    You can play around with this here https://regex101.com/r/IeLC88/3