Search code examples
pythonpython-3.xstringcountertokenize

remove stopwords/punctuation, tokenize and apply Counter()


I have a function written to remove stopwords and tokenize as follows:

def process(text, tokenizer=TweetTokenizer(), stopwords=[]):    
    text = text.lower()  
    tokens = tokenizer.tokenize(text)
    return [tok for tok in tokens if tok not in stopwords and not tok.isdigit()]

I am applying it to a column tweet['cleaned_text'] as follows:

punct = list(string.punctuation)
stopword_list = stopwords.words('english') + punct + ['rt', 'via', '...','“', '”','’']

tf = Counter()
for i  in list(tweet['cleaned_text']):
    temp=process(i, tokenizer=TweetTokenizer(), stopwords=stopword_list)
    tf.update(temp)   
for tag, count in tf.most_common(20):
        print("{}: {}".format(tag, count)) 

The output should be the most common words. Here there are:

#blm: 12718
black: 2751
#blacklivesmatter: 2054
people: 1375
lives: 1255
matter: 1039
white: 914
like: 751
police: 676
get: 564
movement: 563
support: 534
one: 534
racist: 532
know: 520
us: 471
blm: 449
#antifa: 414
hate: 396
see: 382

As you can see, I am not able to get of rid of the hashtag # eventhough it is included in the punctuation list (some stopwords are apparent too). #blm and blm are double counted when they should be the same.

I must be missing something in the code.


Solution

  • when you process tokens you are keeping the entire word, if you want to strip out a leading # you can use str.strip("#")

    def process(text, tokenizer=TweetTokenizer(), stopwords=[]):    
        text = text.lower()  
        tokens = tokenizer.tokenize(text)
        return [tok.strip("#") for tok in tokens if tok not in stopwords and not tok.isdigit()]