Search code examples
pythonpandasdataframetf-idf

Retrieve the matching TFIDF of each words by sentence from a TFIDF matrix (pandas)


My first dataframe contains sentences I tokenized, the second is a matrix of all the TFIDF of each word in each sentence.

I'm trying to create a new column where only the TFIDF of the words in the sentence are stored. How can i do it ?

Tokenize sentences table

Index Tokenized_string
1 [word1,word2,word3]
2 [word1,word3,word4]

Tfidf Table

Index Word1 Word2 ...
1 0.03 0.06 ...
2 0.5 0.5 ...

The table I'm trying to create

Index Tokenized_string TFIDF of each word
1 [word1,word2,word3] [0.03,0.06,0.1]
2 [word1,word3,word4] [0.5,0.4,0.2]

To create the dataframes in my exemple:

import pandas as pd
df = pd.DataFrame({ 'Tokenized_string': 
                   [['word1','word2','word3'],
                    ['word1','word3','word4']]
                   })
    
df_2 = pd.DataFrame({ 'Tokenized_string': 
                   [['word1','word2','word3'],
                    ['word1','word3','word4']],
                   'TFIDF of each word':
                       [[0.03,0.06,0.1],
                        [0.5,0.4,0.2]]})

Solution

  • You can do that with the following.

    Using the following tfidf_df as an example.

    tfidf_df = pd.DataFrame({
        'Word1': [0.03, 0.5],
        'Word2': [0.06, 0.5],
        'Word3': [0.04, 0.5]
                       })
    

    Note that you may need to change the tfidf_df variable based on your naming scheme

    tfidf_df['TFIDF of each word'] = tfidf_df[sorted(tfidf_df.columns)].values.tolist()
    df_2 = pd.concat([df, tfidf_df["TFIDF of each word"]], axis=1)
    
    print(df_2)
            Tokenized_string  TFIDF of each word
    0  [word1, word2, word3]  [0.03, 0.06, 0.04]
    1  [word1, word3, word4]     [0.5, 0.5, 0.5]