I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?
Thank you
data = [['house1',100,1500,'gas','3+1']
,['house2',120,2000,'gas','2+1']
,['house3',40,1600,'electricity','1+1']
,['house4',110,1450,'electricity','2+1']
,['house5',140,1200,'electricity','2+1']
,['house6',90,1000,'gas','3+1']
,['house7',110,1475,'gas','3+1']
]
Create the pandas DataFrame
df = pd.DataFrame(data, columns =
['house','size','price','heating_type','room_count'])
If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher
in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.
from difflib import SequenceMatcher
df = df.set_index('house')
res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()
print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())
Result:
size price heating_type room_count total
house
house1 1.000000 1.00 1.0 1.0 1.000000
house2 0.666667 0.00 1.0 0.0 0.000000
house3 0.000000 0.80 0.0 0.0 0.689942
house4 0.833333 0.90 0.0 0.0 0.882127
house5 0.333333 0.40 0.0 0.0 0.344010
house6 0.833333 0.00 1.0 1.0 0.019859
house7 0.833333 0.95 1.0 1.0 0.932735
Best match of house1 is house7