Search code examples
pandasmatrixfilteringsimilarity

Finding Similarities Between houses in pandas dataframe for content filtering


I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?

Thank you

    data = [['house1',100,1500,'gas','3+1']
    ,['house2',120,2000,'gas','2+1']
    ,['house3',40,1600,'electricity','1+1']
    ,['house4',110,1450,'electricity','2+1']
    ,['house5',140,1200,'electricity','2+1']
    ,['house6',90,1000,'gas','3+1']
    ,['house7',110,1475,'gas','3+1']
   ]

     Create the pandas DataFrame 
    df = pd.DataFrame(data, columns = 
    ['house','size','price','heating_type','room_count']) 

Solution

  • If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.

    from difflib import SequenceMatcher
    
    df = df.set_index('house')
    
    res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
    res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
    res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
    res['total'] = res['size'] + res.price + res.heating_type + res.room_count
    res = 1 - res / res.max()
    
    print(res)
    print('\nBest match of house1 is ' + res.total[1:].idxmax())
    

    Result:

                size  price  heating_type  room_count     total
    house                                                      
    house1  1.000000   1.00           1.0         1.0  1.000000
    house2  0.666667   0.00           1.0         0.0  0.000000
    house3  0.000000   0.80           0.0         0.0  0.689942
    house4  0.833333   0.90           0.0         0.0  0.882127
    house5  0.333333   0.40           0.0         0.0  0.344010
    house6  0.833333   0.00           1.0         1.0  0.019859
    house7  0.833333   0.95           1.0         1.0  0.932735
    
    Best match of house1 is house7