Search code examples
pythonpandasdataframegroup-by

Pandas: Merging 2 Dataframes while keeping the the same index of the main df


I am trying to merge 2 Dataframes: df and relations_df

df: (df's Index is post's ID)

Example df:

Post     similar_posts
-----     --------
Head-1    [123046, 1915, 111357]
Head-2    [654, 239]

Example relations_df:

id   related_id   keywords
---   ---------   --------
17    123046      ['key1', 'key2']
17    1915        ['key1', 'key2', 'key6']
17    111357      ['key3', 'key4']
18    654         ['key4', 'key6', 'key1']
18    239         ['key5', 'key2']

Expected Output: (df's index should persist)

Title     similar_posts           keywords
-----     --------                --------
Head-1    [123046, 1915, 111357]  [['key1', 'key2'], ['key1', 'key2', 'key6'], ['key3', 'key4']]
Head-2    [654, 239]              [['key4', 'key6', 'key1'], ['key5', 'key2']]

I am not very knowledgeable with pandas yet, so I tried to ask ChatGPT and failed to get an acceptable answer...

This is my code as of now:

def addkeywords(args):
    df, relations_df = args
    
    # Merge relations_df with df on 'id' and 'related_id'
    merged_df = pd.merge(df, relations_df, left_on='id', right_on='related_id', suffixes=('_left', '_right'), how='left')
    
    # Group by 'id' and aggregate 'keywords' into lists
    grouped_df = merged_df.groupby('id', as_index=False)['keywords'].agg(list)
    # grouped_df.rename(columns={'keywords_right': 'keywords'}, inplace=True)
    
    # Merge back to the original DataFrame
    df = pd.merge(df, grouped_df, on='id', how='left')
    
    # Fill NaN values in 'keywords' column with empty lists
    df['keywords'].fillna(value='', inplace=True)
    
    return df

The index of 'df' Dataframe should be the same (As the Index represents the post ID in df)

The index of 'relations_df' is autogenerated, so it can change.

Can anyone help me with this please?


Solution

  • # Group by 'id' and aggregate 'related_id' and 'keywords'
    df_grouped = relations_df.groupby('id').agg({'related_id': list, 'keywords': list}).reset_index()
    df_grouped.set_index('id', inplace=True)
      
    # Merge on 'id' of relations_df and 'index' of df
    merged_df = pd.merge(df, df_grouped, left_index=True, right_index=True)
    

    I managed to get the desired output using the aboove code.

    1. Compressed the flattened relations_df and set its index to 'id'
    2. Merged 'df' and 'relations_df' on their indexes