Search code examples
pandasnumpyarray-broadcasting

The difference in results of numpy filtering does not make sense?


I have a sample dataframe which is I uploaded to my Github Gist (because it has 98 rows, but the original data has millions). It has 4 numerical columns, 1 ID column and 1 column which indicates its cluster ID. I have written a function which I apply to that dataframe in two ways:

  • Case A. I groupby by individual and apply the function
  • Case B. I groupby by both individual and cluster and apply the function.

Here is the function in question:

def vectorized_similarity_filtering2(df, cols = ["scaledPrice", "scaledAirlines", "scaledFlights", "scaledTrip"]):
    from sklearn.metrics.pairwise import cosine_similarity
    arr = df[cols].to_numpy()
    b = arr[..., None]
    c = arr.T[None, ...]
    # they must less than equal
    mask = (((b <= c).all(axis=1)) & ((b < c).any(axis=1)))
    mask |= mask.T
    sims = np.where(mask, np.nan, cosine_similarity(arr))
    return np.sum(sims >= 0.6, axis = 1)

What it does in few steps:

  1. It compares current row to all the other rows
  2. It filters out all rows which current row has less or equal values in all dimensions and has less value in at least one dimension.
  3. For the remaining rows, it calculates the cosine similarity between them and the current row
  4. It counts the number of elements in similarity matrix which are greater than 0.6 and returns the result.

By logic, each element of the result of applying to all rows for every individual (case A) must be not less than the each element of the result of applying to all rows for every individual and cluster (case B). Because, case B . However, I see that case B has more elements than case A for some rows. It does not make sense to me, because Case B has less elements to compare to each other. I hope somebody can explain my what is wrong with the code, or my understanding?

Here are steps to replicate the results:

# df being the dataframe
g = df.groupby("individual")
gc = df.groupby(["individual", "cluster"])


caseA = np.concatenate(g.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseB = np.concatenate(gc.apply(lambda x: vectorized_similarity_filtering2(x)).values)

caseA >= caseB
array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True])

EDIT: formatting


Solution

  • The culprit is the order of the cluster groupby which is currently looping through the clusters in this order [0, 2, 1, 5, 3, 4, 11, 6, 7, 12, 8, 9, 10]. This means that the elements aren't aligned in the comparison caseA >= caseB so you are comparing the similarity of different rows to each other.

    One solution is to sort your dataframe first so that your function on the cluster groupby returns values the same order as on the individual groupby like this

    df = df.sort_values(by=['cluster'])
    

    Then it should work!