I am using Regex tokenizer for a text passage, and I would like to extract all words that only have white space before and after that. Here is my code:
tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')
For instance, the sentence "we don't have 500 dollars" will end up becoming "we don have dollars". I would like to have "don" eliminated since it does not end with a whitespace. How do I do so?
You can use positive lookahead and lookbehind to achieve this
Code:
import re
pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
print(re.findall(pattern, "we don't have 500 dollars"))
print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))
Output:
['we', 'have', '500', 'dollars']
['Your', 'no', 'good', 'Torrance']
You can play around with this here https://regex101.com/r/IeLC88/3