In python, I have problem with my work. I have try to change my dataframe to be list and find the result but it doesn't work. My input have 2 pandas.dataframe. I would like to find the similarity of item1between df1 and df2 by using val1, val2 and val3 of Df2 to compare with item1 by using val1, val2 and val3 of Df1. I will use Pearson correlation to find the similarity.
Input:
Df1 Df2
item1 item2 val1 val2 val3 item1 val1 val2 val3
1 2 0.1 0.2 0.3 1 0.1 0.5 0.7
1 3 0.2 0.3 0.5 2 0.2 0.8 0.9
2 4 0.5 0.6 0.7 3 0.7 0.6 0.5
3 5 0.7 0.2 0.1
Output:
Output :
item1 item2 similairity
1 2 0.235
1 3 0.567
2 4 0.414
3 5 0.231
How i will find similarity from these dataframe?
I'm not sure about this solution, because I have another output. But maybe it helps.
Step 1. Create data and merge.
import pandas as pd
from scipy.stats.stats import pearsonr
df1 = pd.DataFrame(data=[[1,2,0.1,0.2,0.3],
[1,3,0.2,0.3,0.5],
[2,4,0.5,0.5,0.7],
[3,5,0.7,0.2,0.1]],
columns=['item1', 'item2', 'val1', 'val2', 'val3'])
df2 = pd.DataFrame(data=[[1,0.1,0.5,0.7],
[2,0.2,0.8,0.9],
[3,0.7,0.6,0.5]],
columns=['item1', 'val1', 'val2', 'val3'])
df = df1.merge(df2,on='item1')
Output:
item1 item2 val1_x val2_x val3_x val1_y val2_y val3_y
0 1 2 0.1 0.2 0.3 0.1 0.5 0.7
1 1 3 0.2 0.3 0.5 0.1 0.5 0.7
2 2 4 0.5 0.5 0.7 0.2 0.8 0.9
3 3 5 0.7 0.2 0.1 0.7 0.6 0.5
Step 2. Definition function to calculate the correlation.
def corr(df):
return pd.DataFrame(data=[pearsonr(
df[['val1_x', 'val2_x', 'val3_x']].as_matrix()[0],
df[['val1_y', 'val2_y', 'val3_y']].as_matrix()[0])[0]],
columns=['similarity'])
Step 3. Use group by items and apply the corr-function.
df = df.groupby(['item1', 'item2']).apply(corr).reset_index().drop(['level_2'],1)
Output:
item1 item2 similarity
0 1 2 0.981981
1 1 3 0.928571
2 2 4 0.609994
3 3 5 0.933257