Search code examples
pythonnltktokenize

How to remove punctuation and numbers during TweetTokenizer step in NLP?


I am relatively new to NLP so please be gentle. I have a complete list of the text from Trump's tweets since taking office and I am tokenizing the text to analyze the content.

I am using the TweetTokenizer from the nltk library in python and I'm trying to get everything tokenized except for numbers and punctuation. Problem is my code removes all the tokens except one.

I have tried using the .isalpha() method but this did not work, which I thought would as should only be True for strings composed from the alphabet.

#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]

#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
        if t.isalpha()]

My output from this is just one token. If I remove the if t.isalpha() statement then I get all of the tokens including numbers and punctuation, suggesting the isalpha() is to blame from the over-trimming.

What I would like, is a way to get the tokens from the tweet text without punctuation and numbers. Thanks for your help!


Solution

  • Try something like below:

    import string
    import re
    import nltk
    from nltk.tokenize import TweetTokenizer
    
    tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
    
    def clean_text(text):
        # remove numbers
        text_nonum = re.sub(r'\d+', '', text)
        # remove punctuations and convert characters to lower case
        text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation]) 
        # substitute multiple whitespace with single whitespace
        # Also, removes leading and trailing whitespaces
        text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
        return text_no_doublespace
    
    cleaned_tweet = clean_text(tweet)
    tt = TweetTokenizer()
    print(tt.tokenize(cleaned_tweet))
    

    output:

    ['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']