Search code examples
pythonpandasdictionarylambdacosine-similarity

Creating a new field from a function based on 2 vectors from a DataFrame using python


I have a DataFrame and would like to create a new field based on a calculation using a function that takes 2 vectors taken from a row of a DataFrame.

For example, I have data that looks like this;

df = pd.DataFrame({
    "A": [1,2,3,4,5],
    "B": [6,7,8,9,10],
    "C": [7,8,1,9,10],
    "D": [2,3,4,5,6],
 })

and I want to calculate the cosine_similarity of [A,B].[C,D] on a row by row basis and then output the result as a new column E

The function I have is as follows;

import sklearn as sk
from sklearn.metrics import pairwise as pw

def similarity(Vec1, Vec2):
    return pw.cosine_similarity(Vec1,Vec2)

I am looking at using the map and lambda functions and currently have the following. The issue here is that this is calculating the cosine similarity down a column, rather than across. Really I would like to be able to do this using indexing so I can choose the fields I need and in case the number of fields gets very large!

df['E'] = map(lambda x,y : similarity(x,y), df.iloc[:,2:], df.iloc[:,:2])

Solution

  • This is one way:

    import numpy as np
    import sklearn as sk
    from sklearn.metrics import pairwise as pw
    
    df = pd.DataFrame({
        "A": [1,2,3,4,5],
        "B": [6,7,8,9,10],
        "C": [7,8,1,9,10],
        "D": [2,3,4,5,6],
     })
    
    df['E'] = df.apply(lambda row: pw.cosine_similarity(np.array([row['A'], row['B']]),
                       np.array([row['C'], row['D']]))[0][0], axis=1)
    
    #    A   B   C  D         E
    # 0  1   6   7  2  0.429057
    # 1  2   7   8  3  0.594843
    # 2  3   8   1  4  0.993533
    # 3  4   9   9  5  0.798815
    # 4  5  10  10  6  0.843661
    

    A more easily extendible solution:

    df['E'] = [pw.cosine_similarity(i, j)[0][0] for i, j in \
               zip(df[df.columns[:2]].values, df[df.columns[2:]].values)]
    

    Functional alternative:

    df['E'] = list(map(lambda i, j: pw.cosine_similarity(i, j)[0][0],
                       df[df.columns[:2]].values,
                       df[df.columns[2:]].values))