I have used sklearn to obtain tfidf scores for my corpus but the output is not in the format I wanted.
Code:
vect = TfidfVectorizer(ngram_range=(1,3))
tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])
df = pd.DataFrame(tfidf_matrix.toarray(),columns=vect.get_feature_names())
df['filename'] = df.index
What I have:
word1, word2, word3 could be any words in the corpus. I mentioned them as word1 , word2, word3 for example.
What I need:
I tried transforming it but it transforms all the columns to rows. Is there a way to achieve this ?
df1 = df.filter(like='word').stack().reset_index()
df1.columns = ['filename','word_name','score']
Output:
filename word_name score
0 0 word1 0.01
1 0 word2 0.04
2 0 word3 0.05
3 1 word1 0.02
4 1 word2 0.99
5 1 word3 0.07
Update for general column headers:
df1 = df.iloc[:,1:].stack().reset_index()