Search code examples
pythonpandasdataframemergeconcatenation

Merge two dataframes groupby the column values of a dataframe


I have the following dataframe:

df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home","Stay home", 'Go outside', "Go Outside","Go outside"],
                    'Child' : ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "sunny"]})

    Parent      Child
0   Stay home   Severe weather
1   Stay home   Severe weather
2   Stay home   Severe weather
3   Go outside  Sunny
4   Go Outside  Sunny
5   Go outside  sunny

And a second one:

df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})

             Similarity_Score
0   SimilarityScore:0.43693185876069784
1   SimilarityScore:0.299807821163373

I want to merge the two dataframes based on the values of Child values of df1.

Expected outcome:

     Parent     Child           Similarity_Score
0   Stay home   Severe weather  0.43693185876069784
1   Stay home   Severe weather  0.43693185876069784
2   Stay home   Severe weather  0.43693185876069784
3   Go outside  Sunny           0.299807821163373
4   Go Outside  Sunny           0.299807821163373
5   Go outside  sunny           0.299807821163373

I tried the usual merge and concat methods but could not find the solution. Any ideas?


Solution

  • If you want to assign the scores based on the Child's value, you can do it like that:

    import numpy as np
    import pandas as pd
    
    df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home", "Stay home", 'Go outside', "Go Outside", "Go outside"],
                        'Child': ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "Sunny"]})
    df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})
    
    # Split the string at : and convert to float
    df2['Similarity_Score'] = df2['Similarity_Score'].str.split(':').str[1].astype(float)
    
    # calculate auxiliary column position to base the matching on
    df1['position'] = df1['Child'].apply(lambda row: np.where(df1['Child'].unique() == row)[0][0])
    
    # merge both dataframes and drop auxiliary column position
    df = df1.merge(df2, left_on='position', right_index=True).drop(columns=["position"])