Search code examples
pythonpandasdataframesimilarityweighted

compute jaccard similarity on dataframe


self learner in python, I am trying to improve so any help is very welcome, thanks lot ! I want to compute a jaccard similarity over a column of my dataframe by matching criteria on another column. df looks like this:

name       bag number       item          quantity
sally         1             BANANA            3
sally         2             BREAD             1
franck        3             BANANA            2
franck        3             ORANGE            1
franck        3             BREAD             4
robert        4             ORANGE            3
jenny         5             BANANA            4
jenny         5             ORANGE            2

With about 80 categorical of items, bag number (sample) is unique to one shoper, but they can have more than one and quantities range from 0 to 4. I would like to iterate through bag number to compare the contents with a jaccard similarity or distance of each pair of bag. If possible with the option of considering the quantity as a weight of comparison. the ideal result would be a dataframe like that Python Pandas Distance matrix using jaccard similarity

I feel that the solution is somewher between this > How to compute jaccard similarity from a pandas dataframe and that How to apply a custom function to groups in a dask dataframe, using multiple columns as function input

I am thinking I should iterate through a mask for setting up the two variable of jaccard function. But in every example I see, the items to compare are in different columns. So I am kind of lost, here... thanks lot for helping! cheers


Solution

  • Tackling the easier, unweighted, version of the problem can be done with the following steps:

    1. create a pivot table with your current dataframe

      p = df.pivot_table(
          index='bag_number',
          columns='item',
          values='quantity',
      ).fillna(0)  # Convert NaN to 0
      
    2. follow the example in your linked question to compute the Jaccard distance with scipy

      from scipy.spatial.distance import jaccard, pdist, squareform
      
      m = 1 - squareform(pdist(p.astype(bool), jaccard))
      sim = pd.DataFrame(m, index=p.index, columns=p.index)
      

    Result:

    bag_number         1         2         3         4         5
    bag_number                                                  
    1           1.000000  0.000000  0.333333  0.000000  0.500000
    2           0.000000  1.000000  0.333333  0.000000  0.000000
    3           0.333333  0.333333  1.000000  0.333333  0.666667
    4           0.000000  0.000000  0.333333  1.000000  0.500000
    5           0.500000  0.000000  0.666667  0.500000  1.000000
    

    The weighted version is only slightly more complicated. The pdist function only supports a vector that it will apply to all comparisons, so you'll need to create a custom similarity (or distance) function. According to Wikipedia, the weighted version can be computed as follows:

    import numpy as np
    
    def weighted_jaccard_distance(x, y):
        arr = np.array([x, y])
        return 1 - arr.min(axis=0).sum() / arr.max(axis=0).sum()
    

    Now you can compute the weighted similarity

    sim_weighted = pd.DataFrame(
        data=1 - squareform(pdist(p, weighted_jaccard_distance)),
        index=p.index,
        columns=p.index,
    )
    

    Result:

    bag_number     1         2         3         4         5
    bag_number                                              
    1           1.00  0.000000  0.250000  0.000000  0.500000
    2           0.00  1.000000  0.142857  0.000000  0.000000
    3           0.25  0.142857  1.000000  0.111111  0.300000
    4           0.00  0.000000  0.111111  1.000000  0.285714
    5           0.50  0.000000  0.300000  0.285714  1.000000