Search code examples
pandasconcatenation

Merging/Concat non unique multi index with Date


I have 2 data frames as below:

df1 =

City       Date           Data1
LA         2020-01-01     20
LA         2020-01-02     30
NY         2020-01-01     50

df2 = 

City       Date           Data2
LA         2020-01-01     2.5
LA         2020-01-02     1
LA         2020-01-03     7
NY         2020-01-01     6.5

I want to merge or concat both of them based on 'City' and 'Date', such that the result will be:

City       Date           Data1    Data2
LA         2020-01-01     20       2.5
LA         2020-01-02     30       1
NY         2020-01-01     50       6.5

What I tried:

pd.concat([df1.set_index(['Country','Date'],[df1.set_index(['Country','Date'])], axis = 1)

And I get error: ValueError: cannot handle a non-unique multi-index!

I cant do merge either since I have Date as index.


Solution

  • Idea is deduplicated pairs by new column created by GroupBy.cumcount:

    print (df2)
      City        Date  Data2
    0   LA  2020-01-01    2.5
    1   LA  2020-01-02    1.0 <- duplicates
    2   LA  2020-01-02    7.0 <- duplicates
    3   NY  2020-01-01    6.5
    
    df1 = (df1.assign(g = df1.groupby(['City','Date']).cumcount())
              .set_index(['City','Date','g']))
    df2 = (df2.assign(g = df2.groupby(['City','Date']).cumcount())
              .set_index(['City','Date','g']))
    
    df = pd.concat([df1, df2], axis = 1)
    print (df)
                       Data1  Data2
    City Date       g              
    LA   2020-01-01 0   20.0    2.5
         2020-01-02 0   30.0    1.0
                    1    NaN    7.0
    NY   2020-01-01 0   50.0    6.5
    

    If need remove helper level g:

    df = pd.concat([df1, df2], axis = 1).reset_index(level=2, drop=True)
    print (df)
                     Data1  Data2
    City Date                    
    LA   2020-01-01   20.0    2.5
         2020-01-02   30.0    1.0
         2020-01-02    NaN    7.0
    NY   2020-01-01   50.0    6.5
    

    EDIT: I think here is necessary convert both columns to DataFrame and then use inner join with DataFrame.merge:

    df1['Date'] = pd.to_datetime(df1['Date'])
    df2['Date'] = pd.to_datetime(df2['Date'])
    
    df = df1.merge(df2, on=['City','Date'])
    print (df)
      City       Date  Data1  Data2
    0   LA 2020-01-01     20    2.5
    1   LA 2020-01-02     30    1.0
    2   NY 2020-01-01     50    6.5