Search code examples
pythonmatrixnlpword-embedding

Construct dataframe from pairwise Word Mover Distance score list


I'd like to run PCA analysis on a list of pairwise sentence distance (word mover distance) I had. So far I've gotten a similarity score on each pair of sentences. Stored all the pairwise similarity scores in a list. My main question is:

How to construct a matrix that contains these similarity score with the original sentences' index? Currently, the list only contains each pair's score. Haven't found a way to map the scores back to the sentence itself yet.

My ideal dataframe looks like this:

>             Sentence1  Sentence2  Sentence3   
 Sentence1.     1          0.5        0.8
 Sentence2      0.5        1          0.4
 Sentence3      0.8        0.4        1

However, the similarity score list I have looks like this, without index:

[0.5, 0.8, 0.4]

How do I transform it to a dataframe that I can run PCA on? Thanks!

----steps I took to construct the pairwise similarity score

# Tokenize all sentences in a column
tokenized_sentences = [s.split() for s in df[col]]

# calculate distance between 2 responses using wmd
def find_similar_docs(sentence_1, sentence_2):
   distance = model.wv.wmdistance(sentence_1, sentence_2)
   return distance

# find response pairs
pairs_sentences = list(combinations(tokenized_sentences, 2))

# get all similiarity scores between sentences
list_of_sim = []
for sent_pair in pairs_sentences:
   sim_curr_pair = find_similar_docs(sent_pair[0], sent_pair[1])
   list_of_sim.append(sim_curr_pair)

It would be a lot easier if I have "1" instead of tokenized sentence (["I", "open", "communication", "culture"]) as index. :) So I'm a bit stuck here...


Solution

  • Make a distance matrix with numpy, then convert to a pandas dataframe.

    import numpy as np
    import pandas as pd
    
    # calculate distance between 2 responses using wmd
    def find_similar_docs(sentence_1, sentence_2):
        distance = model.wv.wmdistance(sentence_1, sentence_2)
        return distance
      
    # create distance matrix
    tokenized_sentences = [s.split() for s in df[col]]
    l = len(tokenized_sentences)
    distances = np.zeros((l, l))
    for i in range(l):
        for j in range(l):
            distances[i, j] = find_similar_docs(tokenized_sentences[i], tokenized_sentences[j])
    
    # make pandas dataframe
    labels = ['sentence' + str(i + 1) for i in range(l)]
    df = pd.DataFrame(data=distances, index=labels, columns=labels)
    print(df)