I have this dataframe:
mylist = [
"₹67.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>11 Feb 2023, 20:42:25",
"₹66.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>10 Feb 2023, 21:09:23",
"₹32.00 to Nagori Sajjad Mohammed Sayyed using Bank Account XXXXXXXX5343<br>9 Feb 2023, 07:06:52",
"₹110.00 to Vikram Manohar Jsohi using Bank Account XXXXXXXX5343<br>9 Feb 2023, 06:40:08",
"₹120.00 to Winner Dinesh Gupta using Bank Account XXXXXXXX5343<br>30 Jan 2023, 06:23:55",
]
import pandas as pd
df = pd.DataFrame(mylist)
df.columns = ["full_text"]
ndf = df.full_text.str.split("to", expand=True)
ndf.columns = ["amt", "full_text"]
ndf2 = ndf.full_text.str.split("using Bank Account XXXXXXXX5343<br>", expand=True)
ndf2.columns = ["client", "date"]
df = ndf.join(ndf2)[["date", "client", "amt"]]
I have created embeddings for each client name:
from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = 'xxx'
embedding_model = "text-embedding-ada-002"
embeddings = df.client.apply([lambda x: get_embedding(x, engine=embedding_model)])
df["embeddings"] = embeddings
I can now calculate the similarity index for a given string. For e.g. "Rupam Sweet" using:
query_embedding = get_embedding("Rupam Sweet", engine="text-embedding-ada-002")
df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))
But I need the similarity score of each client across all other clients. In other words, the client names will be in rows as well as in columns and the score will be the data. How do I achieve this?
If you have a vectorized similarity function f(x, y)
and want to apply it to all pairs of a series, you can make use of numpy broadcasting. If f
is not a vectorized function, you can turn it into one by calling f_vec = np.vectorize(f)
on it. In the example below, I'm using the ratio
function from the fuzzywuzzy
module for illustration purposes, but it works the same way with any other comparison function.
from fuzzywuzzy.fuzz import ratio
import numpy as np
ratio_vec = np.vectorize(ratio)
s = pd.Series(mylist)
df = pd.DataFrame(ratio_vec(s, s[:, None]))
The result is a similarity matrix:
0 1 2 3 4
0 100 92 74 76 71
1 92 100 74 73 72
2 70 74 100 74 67
3 73 73 73 100 72
4 71 72 64 74 100