Search code examples
pythonpandasdataframepandas-apply

Vectorized dataframe filtering with complex logic


I have a very big dataframe with five columns, ID and four numerical. Let's say, integers between 0 and 50. My goal is to calculate cosine similarity matrix for every ID.

However, I want to force some of that matrix values to np.nan. Namely, if within the given ID a row (lets say 1) is not worse than the other row (lets say 2) in all four dimensions while being better in at least one, then the similarity matrix at coordinates [1, 2] and [2, 1] must be set to np.nan.

Sample dataframe:

a   b   c   d   ID
9   9   7   3   1
9   8   3   2   1
6   5   5   6   1
8   4   7   5   1
4   8   7   2   1
4   6   9   5   1
7   4   3   1   1
5   3   5   2   1
8   9   3   9   1
8   2   7   9   1
6   4   1   1   2
3   9   9   3   2
7   6   7   7   2
7   4   9   3   2
2   5   9   2   2
7   6   2   3   2
5   8   7   5   2
6   9   4   1   3
1   6   8   6   3
1   9   7   6   3
2   8   5   4   3
7   2   5   1   3
9   6   5   3   3
8   2   3   2   3
1   8   2   9   3
1   8   1   6   3
3   6   2   4   3
4   2   9   7   3
9   2   6   8   3
1   2   6   3   3

I came up with a function which outputs indices of rows which happen to follow the rule above:

def filter_se(row, df, cols = None):
    if cols:
        df = df[cols]
        row = row[cols]
    #filter for all rows where none of the columns are worse
    df = df[(row >= df).all(axis = 1)]
    #filter for rows where any column is better.
    df = df[(row > df).any(axis = 1)]
    indexes = df.index.tolist()
    return indexes

sample_group = sample.groupby("ID")
r = {}
for index, group in sample_group:
    res = group.apply(filter_se, args = [group, ["a", "b", "c", "d"]], axis = 1).tolist()
    
    r[index] = res

Output: r
{1: [[1, 4, 6, 7], [6], [7], [6, 7], [], [], [], [], [6], []],
 2: [[], [14], [10, 15], [10], [], [10], []],
 3: [[],
  [29],
  [25, 29],
  [],
  [],
  [21, 23],
  [],
  [25],
  [],
  [],
  [29],
  [21, 23, 29],
  []]}

Similarity matrix for each ID I calculated this way:

from sklearn.metrics.pairwise import cosine_similarity

sims = sample.groupby("ID").apply(lambda g: cosine_similarity(g[["a", "b", "c", "d"]]))

This gives me a pd.Series object with similarity matrix being values and IDs being index. The next step is to set the elements of that matrix to np.nan which I am struggling with. I would like to ask for help with this.


Solution

  • Here is a possible solution for a group with a given ID. In the general setting, it can be applied to each ID group. df is the dataframe with the data.

    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    # select the group with a specified ID and convert it into a numpy array
    ID = 1
    arr = df[df["ID"] == ID][["a", "b", "c", "d"]].to_numpy()
    
    # compute the similarity matrix for this group
    b = arr[..., None]
    c = arr.T[None, ...]
    mask = (((b >= c).all(axis=1)) & ((b > c).any(axis=1)))
    mask |= mask.T
    sims = np.where(mask, np.nan, cosine_similarity(arr))