Search code examples
pythonpandasmergeleft-join

Iterrows merge in Pandas results in duplicated columns with suffixes


I have a dataframe of event occurrences in which the date is formatted as 'YYYY-WW'. Various events can take place, some at the same time as other events in the same timeframe. Example datframe as follows;

df1 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03', '2022-01','2022-02','2022-03','2022-01','2022-03'],
                         'event': ['event1','event1','event1','event2','event2','event3','event4','event4'],
                         'event_flag': [1,1,1,1,1,1,1,1,]})

enter image description here

I have a 2nd dataframe to which I want to left join the 1st dataframe. The 2nd dataframe could potentially contain many more dates than is featured in df1 but the for the purposes of this question is as follows:

df2 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana']})

enter image description here

Ultimately, I want to perform a left join such that the values from event in df1 become additional column headers in df2, with the event_flag from df1 becoming a boolean value under the respective column header, as follows:

desired_outcome = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana'],
                         'event1':[1,1,1],
                         'event2':[1,1,0],
                         'event3':[0,0,1],
                         'event4':[1,0,1],
                         })

enter image description here

However, when using iterrows() to achieve this, what I end up with is something that bears some resemblance to the desired outcome but duplicates the columns such that I end up with multiple columns with suffixes, as follows:

for index, row in df1.iterrows():
    index_value = row['event']
    #column_a_value = row['disco']
    yyyyww = row['yyyyww']
    event_flag = row['event_flag']
    df2 = df2.merge(pd.DataFrame({'yyyyww': [yyyyww],
                                                        f'{index_value}': [event_flag]
                                                        }),
                                           left_on='yyyyww', right_on='yyyyww', how='left')
                                        

df2.fillna(0)

enter image description here

How can I perform the required operation without resulting in duplicated columns?


Solution

  • After .merge you can .pivot the dataframe to obtain the final form:

    out = (
        df2.merge(df1, on="yyyyww")
        .pivot(index=["yyyyww", "col1"], columns="event", values="event_flag")
        .fillna(0)
        .astype(int)
        .reset_index()
        .rename_axis(columns=None, index=None)
    )
    print(out)
    

    Prints:

        yyyyww    col1  event1  event2  event3  event4
    0  2022-01   apple       1       1       0       1
    1  2022-02     car       1       1       0       0
    2  2022-03  banana       1       0       1       1