Search code examples
pythonpandasdataframetext-miningtf-idf

data frame of tfidf with Python


I have to classify some sentiments my data frame is like this

Phrase                      Sentiment    
is it  good movie          positive    
wooow is it very goode      positive    
bad movie                  negative

I did some preprocessing as tokenisation stop words stemming etc ... and I get

Phrase                      Sentiment    
[ good , movie  ]        positive    
[wooow ,is , it ,very, good  ]   positive 
[bad , movie ]            negative

I need finally to get a dataframe in which the line are the text which the value is the tf_idf and the columns are the words like that

good     movie   wooow    very      bad                Sentiment
tf idf    tfidf_  tfidf    tf_idf    tf_idf               positive
(same thing for the 2 remaining lines)

Solution

  • I'd use sklearn.feature_extraction.text.TfidfVectorizer, which is specifically designed for such tasks:

    Demo:

    In [63]: df
    Out[63]:
                       Phrase Sentiment
    0       is it  good movie  positive
    1  wooow is it very goode  positive
    2               bad movie  negative
    

    Solution:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    
    vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
    
    X = vect.fit_transform(df.pop('Phrase')).toarray()
    
    r = df[['Sentiment']].copy()
    
    del df
    
    df = pd.DataFrame(X, columns=vect.get_feature_names())
    
    del X
    del vect
    
    r.join(df)
    

    Result:

    In [31]: r.join(df)
    Out[31]:
      Sentiment  bad  good     goode     wooow
    0  positive  0.0   1.0  0.000000  0.000000
    1  positive  0.0   0.0  0.707107  0.707107
    2  negative  1.0   0.0  0.000000  0.000000
    

    UPDATE: memory saving solution:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    
    vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
    
    X = vect.fit_transform(df.pop('Phrase')).toarray()
    
    for i, col in enumerate(vect.get_feature_names()):
        df[col] = X[:, i]
    

    UPDATE2: related question where the memory issue was finally solved