Search code examples
pythonpandasmergeduplicatescartesian

How to avoid Cartesian while merging in Pandas Python


I am trying to merge 2 datasets X and Y. Dataset X has Joining Key column which has duplicate values. Dataset Y has the Joining key column and one additional column. The dataset images have been uploaded below. The problem is that I want to avoid Cartesian Product due to the duplicates present in dataset X. I have attached the Resultant dataset image below. This could be manually done by manually merging using a for loop, but it is time consuming. Anyone can provide a better method

All DataSets Image


Solution

  • Using @Alollz setup:

    df_x = pd.DataFrame({'EMM_ID': [610462, 610462, 610462, 610462, 61000, 61000],
                         'ID_home': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
    df_y = pd.DataFrame({'EMM_ID': [610462, 61000], 'ID_home': [81000, 18]})
    

    You could create a new 'key' to join on with cumcount.

    colkey = 'EMM_ID'
    df_x = df_x.assign(colkey=df_x.groupby(colkey).cumcount())
    df_y = df_y.assign(colkey=df_y.groupby(colkey).cumcount())
    
    df_x[['EMM_ID','colkey']].merge(df_y, on=['EMM_ID','colkey'], how='left')
    

    Output:

       EMM_ID  colkey  ID_home
    0  610462       0  81000.0
    1  610462       1      NaN
    2  610462       2      NaN
    3  610462       3      NaN
    4   61000       0     18.0
    5   61000       1      NaN