Search code examples
pythonpandascosine-similarity

Pandas: Apply function over each pair of columns under constraints


As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code |  14  |  17  |  19  | ...
w1   |  0   |   5  |   3  | ...
w2   |  2   |   5  |   4  | ... 
w3   |  0   |   0  |   5  | ...

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.

The desired output would be something like:

     | [14,17]  |  [14,19]  |  [14,...]  |  [17,19]  | ...
Sim  |cs(14,17) |cs(14,19)  |cs(14,...)  |cs(17,19)..| ...

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)


Solution

  • To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

    Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

    import pandas as pd
    df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
    mask = df.sum(axis=0) > 5
    df2 = df.loc[:, mask]
    

    Then all the cosine similarities can be computed with one call to cdist:

    import scipy.spatial.distance as SSD
    values = SSD.cdist(df2.T, df.T, metric='cosine')
    # array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
    #        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])
    

    The values can be wrapped in a new DataFrame and reshaped:

    result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
    result = result.stack()
    

    import pandas as pd
    import scipy.spatial.distance as SSD
    df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
    mask = df.sum(axis=0) > 5
    df2 = df.loc[:, mask]
    values = SSD.cdist(df2.T, df.T, metric='cosine')
    result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
    result = result.stack()
    mask = result.index.get_level_values(0) != result.index.get_level_values(1)
    result = result.loc[mask]
    print(result)
    

    yields the Series

    17  14    0.292893
        19    0.300000
    19  14    0.434315
        17    0.300000