Search code examples
python-3.xsimilarity

Vector similarity with multiple dtypes (string, int, floats etc.)?


I have the following 2 rows in my dataframe:

[1, 1.1, -19, "kuku", "lulu"]
[2.8, 1.1, -20, "kuku", "lilu"]

I want to calculate their similarity by comparing each dimension (equal? 1, otherwise 0) and get the following vector: [0, 1, 0, 1, 0], is there any function that takes a vector and performs such "similarity" against all rows and calculates mean? In our case it would be 2/5 = 0.4.


Solution

  • I would just use a simple = on NumPy arrays, to be casted as int for the vector and numpy.mean() for the mean of the vector:

    import numpy as np
    
    
    a = [1, 1.1, -19, "kuku", "lulu"] 
    b = [2.8, 1.1, -20, "kuku", "lilu"]
    
    
    res = (np.array(a) == np.array(b)).astype(int)
    print(res)                                                                                                                                             
    # [0 1 0 1 0]
    v = res.mean()
    print(v)
    # 0.4
    

    If you do not mind computing everything twice and you can afford the potentially large intermediate temporary objects:

    import numpy as np
    
    
    arr = np.array([
        [1, 1.1, -19, "kuku", "lulu"],
        [2.8, 1.1, -20, "kuku", "lilu"],
        [2.8, 1.1, -20, "kuku", "lulu"]])
    
    
    corr = arr[None, :, :] == arr[:, None, :]
    score = corr.mean(-1)
    print(score)
    # [[1.  0.4 0.6]
    #  [0.4 1.  0.8]
    #  [0.6 0.8 1. ]]