I have a dataframe like the following but larger:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
and I have calculated the average similarity between each possible pair base on the first column like this( got help for this part from other answers in stackoverflow):
from itertools import combinations
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
#returning the similarity measures for each pair in the dataset
for val in list(combinations(range(len(data_similarity)), 2)):
print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")
what I would like to have as a output is an average across all pairs, so for example if the above code has the following output:
similarity between first value and second value is 60
similarity between first value and third value is 50
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55
I would like to have the average score of first value with any combination, second value with any combination, and third value with any combination like this:
first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is 52.5
EDIT: Based on your comments, here is what you can do.
data_similarity
table which combines the tokens from the different sentences for the group.import nltk
from itertools import combinations, product
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in
product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]
pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
A Similarity
0 First value 47.777778
1 Second value 45.000000
2 Third value 52.777778