Search code examples
pythonnlpnltk

How to strip string from punctuation except apostrophes for NLP


I am using the below "fastest" way of removing punctuation from a string:

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt.

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt the stopwords included are shouldn, shouldn't, t.

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?


Solution

  • >>> from string import punctuation
    >>> type(punctuation)
    <class 'str'>
    >>> my_punctuation = punctuation.replace("'", "")
    >>> my_punctuation
    '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
    >>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
    "It's right isn't it"