My first dataframe contains sentences I tokenized, the second is a matrix of all the TFIDF of each word in each sentence.
I'm trying to create a new column where only the TFIDF of the words in the sentence are stored. How can i do it ?
Tokenize sentences table
Index | Tokenized_string |
---|---|
1 | [word1,word2,word3] |
2 | [word1,word3,word4] |
Tfidf Table
Index | Word1 | Word2 | ... |
---|---|---|---|
1 | 0.03 | 0.06 | ... |
2 | 0.5 | 0.5 | ... |
The table I'm trying to create
Index | Tokenized_string | TFIDF of each word |
---|---|---|
1 | [word1,word2,word3] | [0.03,0.06,0.1] |
2 | [word1,word3,word4] | [0.5,0.4,0.2] |
To create the dataframes in my exemple:
import pandas as pd
df = pd.DataFrame({ 'Tokenized_string':
[['word1','word2','word3'],
['word1','word3','word4']]
})
df_2 = pd.DataFrame({ 'Tokenized_string':
[['word1','word2','word3'],
['word1','word3','word4']],
'TFIDF of each word':
[[0.03,0.06,0.1],
[0.5,0.4,0.2]]})
You can do that with the following.
Using the following tfidf_df
as an example.
tfidf_df = pd.DataFrame({
'Word1': [0.03, 0.5],
'Word2': [0.06, 0.5],
'Word3': [0.04, 0.5]
})
Note that you may need to change the tfidf_df
variable based on your naming scheme
tfidf_df['TFIDF of each word'] = tfidf_df[sorted(tfidf_df.columns)].values.tolist()
df_2 = pd.concat([df, tfidf_df["TFIDF of each word"]], axis=1)
print(df_2)
Tokenized_string TFIDF of each word
0 [word1, word2, word3] [0.03, 0.06, 0.04]
1 [word1, word3, word4] [0.5, 0.5, 0.5]