Search code examples
pythonpandassimilarity

Find similarity of values between 2df python


In python, I have problem with my work. I have try to change my dataframe to be list and find the result but it doesn't work. My input have 2 pandas.dataframe. I would like to find the similarity of item1between df1 and df2 by using val1, val2 and val3 of Df2 to compare with item1 by using val1, val2 and val3 of Df1. I will use Pearson correlation to find the similarity.

Input:

Df1                                  Df2
 item1 item2  val1 val2 val3          item1 val1 val2 val3
  1      2     0.1  0.2  0.3            1    0.1  0.5  0.7
  1      3     0.2  0.3  0.5            2    0.2  0.8  0.9
  2      4     0.5  0.6  0.7            3    0.7  0.6  0.5
  3      5     0.7  0.2  0.1

Output:

Output :
 item1 item2  similairity         
 1      2       0.235        
 1      3       0.567    
 2      4       0.414         
 3      5       0.231

How i will find similarity from these dataframe?


Solution

  • I'm not sure about this solution, because I have another output. But maybe it helps.

    Step 1. Create data and merge.

    import pandas as pd
    from scipy.stats.stats import pearsonr
    
    df1 = pd.DataFrame(data=[[1,2,0.1,0.2,0.3],
                             [1,3,0.2,0.3,0.5],
                             [2,4,0.5,0.5,0.7],
                             [3,5,0.7,0.2,0.1]],
                       columns=['item1', 'item2', 'val1', 'val2', 'val3'])
    
    df2 = pd.DataFrame(data=[[1,0.1,0.5,0.7],
                             [2,0.2,0.8,0.9],
                             [3,0.7,0.6,0.5]],
                       columns=['item1', 'val1', 'val2', 'val3'])
    
    df = df1.merge(df2,on='item1')
    

    Output:

       item1  item2  val1_x  val2_x  val3_x  val1_y  val2_y  val3_y
    0      1      2     0.1     0.2     0.3     0.1     0.5     0.7
    1      1      3     0.2     0.3     0.5     0.1     0.5     0.7
    2      2      4     0.5     0.5     0.7     0.2     0.8     0.9
    3      3      5     0.7     0.2     0.1     0.7     0.6     0.5
    

    Step 2. Definition function to calculate the correlation.

    def corr(df):
        return pd.DataFrame(data=[pearsonr(
            df[['val1_x', 'val2_x', 'val3_x']].as_matrix()[0],
            df[['val1_y', 'val2_y', 'val3_y']].as_matrix()[0])[0]], 
                            columns=['similarity'])
    

    Step 3. Use group by items and apply the corr-function.

    df = df.groupby(['item1', 'item2']).apply(corr).reset_index().drop(['level_2'],1)
    

    Output:

       item1  item2  similarity
    0      1      2    0.981981
    1      1      3    0.928571
    2      2      4    0.609994
    3      3      5    0.933257