I'm trying to calculate the word frequency for a messaging dataframe using TF-IDF. So far I have this
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
However with the code above I get a bunch of zeros instead of the words frequency. How can I fix this to get the correct number frenquncy for the messages. This is my dataframe
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]
First of all for the counts, you don't want to use TfidfVectorizer as it is normalized. You want to use CountVectorizer. Second, you dont need to tokenize the words as sklearn has a build in tokenizer with both TfidfVectorizer and CountVectorizer.
#add whatever settings you want
countVec =CountVectorizer()
#fit transform
cv = countVec.fit_transform(df['message'].str.lower())
#feature names
cv_feature_names = countVec.get_feature_names()
#feature counts
feature_count = cv.toarray().sum(axis = 0)
#feature name to count
dict(zip(cv_feature_names, feature_count))