Search code examples
pythonpandasdataframemulti-index

Efficiently combine dataframes on 2nd level index


I have two dataframes looking like

import pandas as pd

df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'

df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
                           ('A','b',5),('A','b',6),('B','c',7),
                           ('B','c',9),('B','d',10),('B','d',11),
                          ], names=('big', 'small', 't')))

I am searching for an efficient way to combine them such that I get

                 0
big small t       
A   a     1    NaN
          2    2.1
          4    4.2
    b     5    NaN
          6    6.3
B   c     7    NaN
          8    8.4
          9    NaN
    d     10  10.5
          11   NaN

I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.

Of course a loop over the dataframe would work as well, though not feasible for large dataframes.

EDIT: It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.


Solution

  • Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).

    res = (df2.reset_index()
              .merge(df1, on='t', how='outer')
              .set_index(df2.index.names)
              .sort_index(level='t'))
    
    res.index = pd.MultiIndex.from_arrays(
                    [pd.Series(res.index.get_level_values(i)).ffill()
                     for i in range(res.index.nlevels)],
                    names=res.index.names)
    

    print(res)
    
                     0
    big small t       
    A   a     1    NaN
              2    2.1
              4    4.2
        b     5    NaN
              6    6.3
    B   c     7    NaN
              8    8.4
              9    NaN
        d     10  10.5
              11   NaN