Search code examples
pythonpandasdataframepandas-melt

join 2 dataframes to get output


Using df1 and df2 below, I would like to create df_new that has feat0, feat1, feat2 from df1 and their corresponding values from df2 as value_feat0, value_feat1, value_feat2. I have also created df_new as expected output.

df1 = pd.DataFrame([['Age', 'Education-Num', 'Sex'],
                    ['Sex','Education-Num','Age'],
                    ['Education-Num','Age','Sex'],
                    ['Age', 'Education-Num', 'Sex'],
                    ['Age', 'Education-Num', 'Sex']],
                   columns=['feat0', 'feat1', 'feat2'])
df2 = pd.DataFrame([[39.0, 7, 13.0, 4, 1, 0, 4, 1, 2174.0, 0.0, 40.0, 39],
                    [50.0, 6, 13.0, 2, 4, 4, 4, 1, 0.0, 0.0, 13.0, 39],
                    [38.0, 4, 9.0, 0, 6, 0, 4, 1, 0.0, 0.0, 40.0, 39],
                    [53.0, 4, 7.0, 2, 6, 4, 2, 1, 0.0, 0.0, 40.0, 39],
                    [28.0, 4, 13.0, 2, 10, 5, 2, 0, 0.0, 0.0, 40.0, 5]],
                   columns=['Age', 'Workclass', 'EducationNum', 'MaritalStatus', 'Occupation',
                            'Relationship', 'Race', 'Sex', 'CapitalGain', 'CapitalLoss',  
                            'Hoursperweek', 'Country'])

Expected output:

df_new = pd.DataFrame([['Age', 39, 'EducationNum', 13, 'Sex', 1],
                       ['Sex',1, 'EducationNum',13, 'Age', 50],
                       ['EducationNum',9, 'Age',38, 'Sex', 1],
                       ['Age', 38, 'EducationNum', 7, 'Sex', 1],
                       ['Age', 28, 'EducationNum', 13, 'Sex', 0]],
                      columns=['feat0', 'value_feat0', 'feat1', 'value_feat1', 'feat2', 'value_feat2'])

I tried using melt() and merge the dataframes, but could not succeed.


Solution

  • You can use a for loop on the columns of df1, and then use this proposed method to lookup for your values in df2:

    df_new = pd.DataFrame()
    
    for col in df1.columns:
        df_new[col] = df1[col]
        idx, cols = pd.factorize(df1[col])
        df_new['value_'+col] = df2.reindex(cols, axis=1).to_numpy()[np.arange(len(df2)), idx]
    

    Output:

        feat0           value_feat0     feat1           value_feat1     feat2   value_feat2
    0   Age             39.0            Education-Num   13.0            Sex     1.0
    1   Sex             1.0             Education-Num   13.0            Age     50.0
    2   Education-Num   9.0             Age             38.0            Sex     1.0
    3   Age             53.0            Education-Num   7.0             Sex     1.0
    4   Age             28.0            Education-Num   13.0            Sex     0.0
    

    Alternatively this could be done with the deprecated lookup method (not recommended):

    df_new = pd.DataFrame()
    
    for col in df1.columns:
        df_new[col] = df1[col]
        df_new['value_'+col] = df2.lookup(df1[col].index, df1[col].values)
    

    Note: I used the column name Education-Num in df2 to be consistent with df1.