Search code examples
python-3.xpandasword-frequencytfidfvectorizer

word frequency with TfidfVectorizer


I'm trying to calculate the word frequency for a messaging dataframe using TF-IDF. So far I have this

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())

However with the code above I get a bunch of zeros instead of the words frequency. How can I fix this to get the correct number frenquncy for the messages. This is my dataframe

user_id     date          message      tokenized_sents      tokenized_vector
X35WQ0U8S   2019-02-17    Need help    ['need','help']      [0.0,0.0]
X36WDMT2J   2019-03-22    Thank you!   ['thank','you','!']  [0.0,0.0,0.0]

Solution

  • First of all for the counts, you don't want to use TfidfVectorizer as it is normalized. You want to use CountVectorizer. Second, you dont need to tokenize the words as sklearn has a build in tokenizer with both TfidfVectorizer and CountVectorizer.

    #add whatever settings you want
    countVec =CountVectorizer()
    
    #fit transform
    cv = countVec.fit_transform(df['message'].str.lower())
    
    #feature names
    cv_feature_names = countVec.get_feature_names()
    
    #feature counts
    feature_count = cv.toarray().sum(axis = 0)
    
    #feature name to count
    dict(zip(cv_feature_names, feature_count))