Search code examples
pythonpandaslistdataframeseries

Pandas map many to one instead of merge without dropping duplicates?


I have two dataframes like as below

data_df = pd.DataFrame({'person_id': ['abc@gmail.com','abc@gmail.com','abc@gmail.com','ace@gmail.com','ace@gmail.com','pqr@gmail.com','pqr@gmail.com'],
             'company': ['a','a','a','a','a','a','a'],
             'dept_access':['a1','a1','a1','a1','a2','a2','a2']})

key_df = pd.DataFrame({'p_id': ['abc@gmail.com','xyz@gmail.com','pqr@gmail.com'],
             'company': ['a','a','a'],
             'location':['UK','USA','KOREA']})

My objective is to do the below

a) Attach location column from key df to data df

So, I tried the merge option like below

data_df.merge(key_df,left_on='person_id',right_on='p_id',how='left')

But this results in more records than original data_df because of duplicates in merging column.

So, therefore, I would like to use map approach as I know my key_df will have one unique_key for each user. So, I was trying something like below

s = key_df.set_index(['p_id'])['location']
data_df['location'] = data_df[('person_id')].map(s)

but this doesn't work as well.

I expect my output to have 4 columns as below

enter image description here


Solution

  • In your data second solution working well, in real data is possible remove duplicates first:

    s = key_df.drop_duplicates('p_id').set_index(['p_id'])['location']
    data_df['location'] = data_df[('person_id')].map(s)
    print (data_df)
    
           person_id company dept_access location
    0  abc@gmail.com       a          a1       UK
    1  abc@gmail.com       a          a1       UK
    2  abc@gmail.com       a          a1       UK
    3  ace@gmail.com       a          a1      NaN
    4  ace@gmail.com       a          a2      NaN
    5  pqr@gmail.com       a          a2    KOREA
    6  pqr@gmail.com       a          a2    KOREA