Search code examples
pythonpandasscikit-learntf-idftfidfvectorizer

How to Transform sklearn tfidf vector pandas output to a meaningful format


I have used sklearn to obtain tfidf scores for my corpus but the output is not in the format I wanted.

Code:

vect = TfidfVectorizer(ngram_range=(1,3))
tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])

df = pd.DataFrame(tfidf_matrix.toarray(),columns=vect.get_feature_names())

df['filename'] = df.index

What I have:

enter image description here

word1, word2, word3 could be any words in the corpus. I mentioned them as word1 , word2, word3 for example.

What I need:

enter image description here

I tried transforming it but it transforms all the columns to rows. Is there a way to achieve this ?


Solution

  • df1 = df.filter(like='word').stack().reset_index()
    df1.columns = ['filename','word_name','score']
    

    Output:

       filename word_name  score
    0         0     word1   0.01
    1         0     word2   0.04
    2         0     word3   0.05
    3         1     word1   0.02
    4         1     word2   0.99
    5         1     word3   0.07
    

    Update for general column headers:

    df1 = df.iloc[:,1:].stack().reset_index()