Search code examples

pandas: calculatig average similarity across all categories

I have a dataframe like the following but larger:

import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
        'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

and I have calculated the average similarity between each possible pair base on the first column like this( got help for this part from other answers in stackoverflow):

from itertools import combinations
#function to calculate similarity between each pairs of documents 
def similarity_measure(doc1, doc2): 

    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    return float (len(intersection)) / len(union) * 100

    #getting the lemmatized text along side the intents
    data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
     data_similarity = data_similarity.reset_index()

   #returning the similarity measures for each pair in the dataset
    for val in list(combinations(range(len(data_similarity)), 2)):
         print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")

what I would like to have as a output is an average across all pairs, so for example if the above code has the following output:

similarity between first value and second value is 60
similarity between first value and third value is 50 
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55

I would like to have the average score of first value with any combination, second value with any combination, and third value with any combination like this:

first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is  52.5


  • EDIT: Based on your comments, here is what you can do.

    1. First calculate data_similarity table which combines the tokens from the different sentences for the group.
    2. Calculate pairwise similarity tuples between sentences
    3. Put them into a dataframe and then groupby the overall group and take mean.
    import nltk
    from itertools import combinations, product
    #function to calculate similarity between each pairs of documents 
    def similarity_measure(doc1, doc2): 
        words_doc1 = set(doc1) 
        words_doc2 = set(doc2)
        intersection = words_doc1.intersection(words_doc2)
        union = words_doc1.union(words_doc2)
        return float (len(intersection)) / len(union) * 100
    #getting the lemmatized text along side the intents
    data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
    data_similarity = data_similarity.reset_index()
    all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in 
                 product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]
    pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
    group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
                  A  Similarity
    0   First value   47.777778
    1  Second value   45.000000
    2   Third value   52.777778