I am struggling to result the similarity between a series of two rows into a new series of columns if and only if another column meets a specific criteria. For example, suppose I have a df with four people, their friend status, and their social preferences.
preference = {'person': ["Sara","Jordan","Amish","Kimmie"],'game_night':[30,10,50,30], 'movies': [10,10,20,10], 'dinner_out': [20,20,30,10] }
near = {'person': ["Sara","Jordan","Amish","Kimmie"], 'friendSara':[0,1,0,0], 'friendJordan': [1,0,1,1], 'friendAmish': [0,1,0,1], 'friendKimmie': [0,1,1,0]}
df = pd.DataFrame(data=preference)
near_df = pd.DataFrame(data=near)
Please challenge me if you feel there is a better way to organize the df or to approach the problem, but I'm looking to, in this example, create a series of new columns named 'simSara', 'simJordan', etc. that fill with the dot(person1_preferences, person2_preferences)/(norm(person1_preferences)*norm(person2_preferences))
between each person's 3 social preferences and the others. For example, the first column added named 'simSara' would have a second row populated by 0.873 (because Jordan and Sara are friends)
Create a numpy array that summarizes each person's preferences as a vector, with each vector being a np.array
as well
prefVec = df.apply(lambda x: np.array([x.game_night,x.movies,x.dinner_out]),axis=1).to_numpy()
Should have something like this:
array([
array([30, 10, 20]),
array([10, 10, 20]),
array([50, 20, 30]),
array([30, 10, 10])
],
dtype=object)
define a custom function for your operation:
def getVal(v1,v2):
return np.sum(v1*v2)/(np.sqrt((v1**2).sum())*np.sqrt((v2**2).sum()))
Now we essentially need to do a custom inner product using our previously defined function. np.frompyfunc
takes our custom function and integers specifying number of inputs and outputs of our custom function. By passing prefVec
vertically and horizontally to this customFunc
, we broadcast the operation. This means our horizontal prefVec
is "stretched" into a matrix, which we will then have it go through our custom inner product with our column prefVec
:
customFunc = np.frompyfunc(getVal,2,1)
out = customFunc(prefVec.reshape(-1,1),prefVec)
# ^column prefVec ^horizontal prefVec
out
should look like this:
array([[1. , 0.87287156, 0.99717646, 0.96698756],
[0.87287156, 1. , 0.86094603, 0.73854895],
[0.99717646, 0.86094603, 1. , 0.97823198],
[0.96698756, 0.73854895, 0.97823198, 1. ]])
Turning it into a dataframe by getting a list of persons from your original df.person
column
pd.DataFrame(
out,
columns=df.person.apply(lambda x: 'sim{}'.format(x)).to_numpy(),
index=df.person
).reset_index()
output:
person simSara simJordan simAmish simKimmie
0 Sara 1.000000 0.872872 0.997176 0.966988
1 Jordan 0.872872 1.000000 0.860946 0.738549
2 Amish 0.997176 0.860946 1.000000 0.978232
3 Kimmie 0.966988 0.738549 0.978232 1.000000
If you want them all in the same dataframe, merge the above output with your original df on the person
column