Search code examples
pythonpandasdataframejoinconcatenation

Pandas: concat with duplicated index


I am trying to do concentration for four DataFrames. df has unique index and other 3 of them has duplicated values in index. Here's my code:

import pandas as pd

data = {'id':['1','2','3','4','5','6'],
        'name':['A1','A2','A2','A1','A3','A3'],
        'date_create':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df = pd.DataFrame(data)

data = {'userid':['1','2','3','3','5','6'],
        'likesDate':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df1 = pd.DataFrame(data)

data = {'userid':['1','1','2','4','6','6'],
        'dislikesDate':['8-01-2020','1-02-2020','15-03-2020',
                       '25-03-2020','01-04-2020','30-05-2020']}
df2 = pd.DataFrame(data)

data = {'userid':['1','2','2','4','4','6'],
        'DeleteDate':['28-01-2020','15-02-2020','15-03-2020',
                       '25-03-2020','05-04-2020','23-05-2020']}
df3 = pd.DataFrame(data)

df.set_index('id', inplace=True)
df1.set_index('userid', inplace=True)
df2.set_index('userid', inplace=True)
df3.set_index('userid', inplace=True)

df_final = pd.concat([df,df1, df2,df3],axis=1)

print(df_final)

I am getting this error:

ValueError: Shape of passed values is (7, 5), indices imply (6, 5)

What,where am I doing wrong?


Solution

  • IIUC, you can chain the join instead of using concat as you have duplicated index values. If you have only 3 dataframes, you can probably write it fully:

    df_final = df.join(df1).join(df2).join(df3)
    print(df_final.head())
    #   name date_create   likesDate dislikesDate  DeleteDate
    # 1   A1  28-01-2020  28-01-2020    8-01-2020  28-01-2020
    # 1   A1  28-01-2020  28-01-2020    1-02-2020  28-01-2020
    # 2   A2  15-02-2020  15-02-2020   15-03-2020  15-02-2020
    # 2   A2  15-02-2020  15-02-2020   15-03-2020  15-03-2020
    # 3   A2  15-03-2020  15-03-2020          NaN         NaN
    

    but in case you want a more flexible solution that can accept more dataframes, then use reduce

    from functools import reduce
    
    df_final = (
        reduce(lambda x,y:x.join(y), [df,df1, df2,df3]) # same a chaining joins
          .rename_axis('user_id').reset_index() # user id back as a column if needed
    )
    print(df_final)
    #    user_id name date_create   likesDate dislikesDate  DeleteDate
    # 0        1   A1  28-01-2020  28-01-2020    8-01-2020  28-01-2020
    # 1        1   A1  28-01-2020  28-01-2020    1-02-2020  28-01-2020
    # 2        2   A2  15-02-2020  15-02-2020   15-03-2020  15-02-2020
    # 3        2   A2  15-02-2020  15-02-2020   15-03-2020  15-03-2020
    # 4        3   A2  15-03-2020  15-03-2020          NaN         NaN
    # 5        3   A2  15-03-2020  25-03-2020          NaN         NaN
    # 6        4   A1  25-03-2020         NaN   25-03-2020  25-03-2020
    # 7        4   A1  25-03-2020         NaN   25-03-2020  05-04-2020
    # 8        5   A3  01-04-2020  01-04-2020          NaN         NaN
    # 9        6   A3  30-05-2020  30-05-2020   01-04-2020  23-05-2020
    # 10       6   A3  30-05-2020  30-05-2020   30-05-2020  23-05-2020