Search code examples
pythoncalculated-columnscosine-similarity

Similarity of two rows placed into new column, based on column condition


I am struggling to result the similarity between a series of two rows into a new series of columns if and only if another column meets a specific criteria. For example, suppose I have a df with four people, their friend status, and their social preferences.

preference = {'person': ["Sara","Jordan","Amish","Kimmie"],'game_night':[30,10,50,30], 'movies': [10,10,20,10], 'dinner_out': [20,20,30,10] }
near = {'person': ["Sara","Jordan","Amish","Kimmie"], 'friendSara':[0,1,0,0], 'friendJordan': [1,0,1,1], 'friendAmish': [0,1,0,1], 'friendKimmie': [0,1,1,0]}

df = pd.DataFrame(data=preference)
near_df = pd.DataFrame(data=near)

Please challenge me if you feel there is a better way to organize the df or to approach the problem, but I'm looking to, in this example, create a series of new columns named 'simSara', 'simJordan', etc. that fill with the dot(person1_preferences, person2_preferences)/(norm(person1_preferences)*norm(person2_preferences)) between each person's 3 social preferences and the others. For example, the first column added named 'simSara' would have a second row populated by 0.873 (because Jordan and Sara are friends)


Solution

  • Create a numpy array that summarizes each person's preferences as a vector, with each vector being a np.array as well

    prefVec = df.apply(lambda x: np.array([x.game_night,x.movies,x.dinner_out]),axis=1).to_numpy()
    

    Should have something like this:

    array([
        array([30, 10, 20]), 
        array([10, 10, 20]), 
        array([50, 20, 30]),
        array([30, 10, 10])
    ], 
    dtype=object)
    

    define a custom function for your operation:

    def getVal(v1,v2):
        return np.sum(v1*v2)/(np.sqrt((v1**2).sum())*np.sqrt((v2**2).sum()))
    

    Now we essentially need to do a custom inner product using our previously defined function. np.frompyfunc takes our custom function and integers specifying number of inputs and outputs of our custom function. By passing prefVec vertically and horizontally to this customFunc, we broadcast the operation. This means our horizontal prefVec is "stretched" into a matrix, which we will then have it go through our custom inner product with our column prefVec:

    customFunc = np.frompyfunc(getVal,2,1)
    out = customFunc(prefVec.reshape(-1,1),prefVec)
    #                  ^column prefVec       ^horizontal prefVec
    

    out should look like this:

    array([[1.        , 0.87287156, 0.99717646, 0.96698756],
           [0.87287156, 1.        , 0.86094603, 0.73854895],
           [0.99717646, 0.86094603, 1.        , 0.97823198],
           [0.96698756, 0.73854895, 0.97823198, 1.        ]])
    

    Turning it into a dataframe by getting a list of persons from your original df.person column

    pd.DataFrame(
        out,
        columns=df.person.apply(lambda x: 'sim{}'.format(x)).to_numpy(),
        index=df.person
    ).reset_index()
    

    output:

        person  simSara simJordan   simAmish    simKimmie
    0   Sara    1.000000    0.872872    0.997176    0.966988
    1   Jordan  0.872872    1.000000    0.860946    0.738549
    2   Amish   0.997176    0.860946    1.000000    0.978232
    3   Kimmie  0.966988    0.738549    0.978232    1.000000
    

    If you want them all in the same dataframe, merge the above output with your original df on the person column