Search code examples
pythonnltktokenize

How to remove ' in strings with RegexpTokenizer


from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())

Currently returns: text = ['that', 's', 'some', 'text', 'you', 'know']

I need it to return: Currently returns: text = ['thats', 'some', 'text', 'you', 'know'] (The "thats" is one word)


Solution

  • There are 2 solutions. Either you want to preprocess your text variable with:

    text = text.replace("'", "")
    

    or you want to match "that's" as a single word with this modification:

    tokenizer = RegexpTokenizer(r'[\w\']+')