Search code examples
pythonpandasdataframemergeconcatenation

Combining two Dataframes with same column names, but different MultiIndex


I have two different DataFrames that I would like to combine and I cannot get the answer I am looking for. A MWE is given below.

import numpy as np
import pandas as pd

index_tuple1 = [('car', 'green'), ('car', 'red'), ('boat', 'green')]
index1 = pd.MultiIndex.from_tuples(index_tuple1, names=['Vehicle', 'Color'])
data1 = np.array([[1,2], [3,4], [7,8]])
df1 = pd.DataFrame(data1, index=index1, columns=['2022-05-09', '2022-10-17'])

index_tuple2 = [('car', 'blue'), ('boat', 'blue')]
index2 = pd.MultiIndex.from_tuples(index_tuple2, names=['Vehicle', 'Color'])
data2 = np.array([[5,6], [9,0]])
df2 = pd.DataFrame(data2, index=index2, columns=['2022-05-09', '2022-10-17'])

print(df1)
print(df2)


Output:
               2022-05-09  2022-10-17
Vehicle Color
car     green           1           2
        red             3           4
boat    green           7           8

               2022-05-09  2022-10-17
Vehicle Color
car     blue            5           6
boat    blue            9           0

What I would like to see as output is given below. I have tried concat, merge, and join, but I either get a DataFrame filled with NaNs or "ValueError: columns overlap but no suffix specified: Index(['2022-05-09', '2022-10-17']

               2022-05-09  2022-10-17
Vehicle Color
car     green           1           2
        red             3           4
        blue            5           6
boat    green           7           8
        blue            9           0 

Solution

  • You need same types of columns, e.g. strings or datetimes, then concat working correct:

    df1.columns = pd.to_datetime(df1.columns)
    df2.columns = pd.to_datetime(df2.columns)
    
    df = pd.concat([df1, df2]).sort_index()
    print (df)
                   2022-05-09  2022-10-17
    Vehicle Color                        
    boat    blue            9           0
            green           7           8
    car     blue            5           6
            green           1           2
            red             3           4
    

    If ordering is important use here is solution with ordered CategoricalIndex:

    df1.columns = pd.to_datetime(df1.columns)
    df2.columns = pd.to_datetime(df2.columns)
    
    lvl = df1.index.get_level_values(0).append(df1.index.get_level_values(0)).unique()
    print (lvl)
    Index(['car', 'boat'], dtype='object', name='Vehicle')
    
    idx1 = pd.CategoricalIndex(df1.index.levels[0], ordered=True, categories=lvl)
    idx2 = pd.CategoricalIndex(df2.index.levels[0], ordered=True, categories=lvl)
    df1.index = df1.index.set_levels(idx1, level = 0)
    df2.index = df2.index.set_levels(idx2, level = 0)
    
    
    df = pd.concat([df1, df2]).sort_index(level=0, sort_remaining=False)
    print (df)
                   2022-05-09  2022-10-17
    Vehicle Color                        
    car     green           1           2
            red             3           4
            blue            5           6
    boat    green           7           8
            blue            9           0