I have a function written to remove stopwords and tokenize as follows:
def process(text, tokenizer=TweetTokenizer(), stopwords=[]):
text = text.lower()
tokens = tokenizer.tokenize(text)
return [tok for tok in tokens if tok not in stopwords and not tok.isdigit()]
I am applying it to a column tweet['cleaned_text']
as follows:
punct = list(string.punctuation)
stopword_list = stopwords.words('english') + punct + ['rt', 'via', '...','“', '”','’']
tf = Counter()
for i in list(tweet['cleaned_text']):
temp=process(i, tokenizer=TweetTokenizer(), stopwords=stopword_list)
tf.update(temp)
for tag, count in tf.most_common(20):
print("{}: {}".format(tag, count))
The output should be the most common words. Here there are:
#blm: 12718
black: 2751
#blacklivesmatter: 2054
people: 1375
lives: 1255
matter: 1039
white: 914
like: 751
police: 676
get: 564
movement: 563
support: 534
one: 534
racist: 532
know: 520
us: 471
blm: 449
#antifa: 414
hate: 396
see: 382
As you can see, I am not able to get of rid of the hashtag #
eventhough it is included in the punctuation
list (some stopwords are apparent too). #blm and blm are double counted when they should be the same.
I must be missing something in the code.
when you process tokens you are keeping the entire word, if you want to strip out a leading #
you can use str.strip("#")
def process(text, tokenizer=TweetTokenizer(), stopwords=[]):
text = text.lower()
tokens = tokenizer.tokenize(text)
return [tok.strip("#") for tok in tokens if tok not in stopwords and not tok.isdigit()]