I have a dataframe df
with 2 columns of text embeddings namely embedding_1
and embedding_2
. I want to create a third column in df
named distances
which should contain the cosine_similarity between every row of embedding_1
and embedding_2
.
But when I try to implement this using the following code I get a ValueError
.
How to fix it?
Dataframe df
embedding_1 | embedding_2
[[-0.28876397, -0.6367827, ...]] | [[-0.49163356, -0.4877703,...]]
[[-0.28876397, -0.6367827, ...]] | [[-0.06686627, -0.75147504...]]
[[-0.28876397, -0.6367827, ...]] | [[-0.42776933, -0.88310856,...]]
[[-0.28876397, -0.6367827, ...]] | [[-0.6520882, -1.049325,...]]
[[-0.28876397, -0.6367827, ...]] | [[-1.4216679, -0.8930428,...]]
Code to Calculate Cosine Similarity
df['distances'] = cosine_similarity(df['embeddings_1'], df['embeddings_2'])
Error
ValueError: setting an array element with a sequence.
Required Dataframe
embedding_1 | embedding_2 | distances
[[-0.28876397, -0.6367827, ...]] | [[-0.49163356, -0.4877703,...]] | 0.427
[[-0.28876397, -0.6367827, ...]] | [[-0.06686627, -0.75147504...]] | 0.673
[[-0.28876397, -0.6367827, ...]] | [[-0.42776933, -0.88310856,...]] | 0.882
[[-0.28876397, -0.6367827, ...]] | [[-0.6520882, -1.049325,...]] | 0.665
[[-0.28876397, -0.6367827, ...]] | [[-1.4216679, -0.8930428,...]] | 0.312
You can use apply()
to use cosine_similarity()
on each row:
def cal_cosine_similarity(row):
return cosine_similarity(row['embeddings_1'], row['embeddings_2'])
df['distances'] = df.apply(cal_cosine_similarity, axis=1)
or one liner
df['distances'] = df.apply(lambda row: cosine_similarity(row['embeddings_1'], row['embeddings_2']), axis=1)