Search code examples
pythonpandasnlp

How to get average pairwise cosine similarity per group in Pandas


I have a sample dataframe as below

df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])

enter image description here

Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.

I wrote below code:

import gensim
from gensim import utils
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.metrics.pairwise import cosine_similarity

 # map each word to vector space
    def represent(sentence):
        vectors = []
        for word in sentence:
            try:
                vector = model.wv[word]
                vectors.append(vector)
            except KeyError:
                pass
        return np.array(vectors).mean(axis=0)
    
    # get average if more than 1 word is included in the "text" column
    def document_vector(items):
        # remove out-of-vocabulary words
        doc = [word for word in items if word in model_glove.vocab]
        if doc:
            doc_vector = model_glove[doc]
            mean_vec=np.mean(doc_vector, axis=0)
        else:
            mean_vec = None
        return mean_vec
    
# get average pairwise cosine distance score 
def mean_cos_sim(grp):
   output = []
   for i,j in combinations(grp.index.tolist(),2 ): 
       doc_vec=document_vector(grp.iloc[i]['text'])
       if doc_vec is not None and len(doc_vec) > 0:      
           sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
           output.append([i, j, sim])
       return np.mean(np.array(output), axis=0)

# save the result to a new column    
df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)

However, I got below error:

enter image description here

Could you kindly help? Thanks!


Solution

  • Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.

    # get average if more than 1 word is included in the "text" column
    def document_vector(items):
        # remove out-of-vocabulary words
        doc = [word for word in items.split() if word in model_glove]
        if doc:
            doc_vector = model_glove[doc]
            mean_vec = np.mean(doc_vector, axis=0)
        else:
            mean_vec = None
        return mean_vec
    

    Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:

    # get pairwise cosine similarity score
    def mean_cos_sim(grp):
        output = []
        for i, j in combinations(grp.tolist(), 2):
            if document_vector(i) is not None and len(document_vector(i)) > 0:
                sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                output.append(sim)
        return np.mean(output, axis=0)
    

    Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):

    df = pd.DataFrame(np.array(
        [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
         ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
    df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
    

    Which overall will give you (Edit: updated):

    print(df_grpd)
    > firm
      apple       [[0.53190523]]
      facebook    [[0.83989316]]
      Name: text, dtype: object
    

    Edit:

    I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.