Search code examples
pythonpandasdataframemulti-indexfillna

Using fillna with two multi-index dataframes throws InvalidIndexError


I have two dataframes like this:

import pandas as pd
import numpy as np


df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675987'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

          prop1 prop2
key1 key2            
A    1        x     m
B    6        y     n
A    7        z     b
     5        u     b
C    9        y     b
     8        n     a
A    7        b     s

           prop1  prop2
key1 key2              
A    1       NaN    NaN
B    5       NaN    NaN
C    9       NaN    NaN
     8       NaN    NaN
A    7       NaN    NaN
D    8       NaN    NaN
     7       NaN    NaN

and would now like to use df1 to fill df2 using

df2.fillna(df1)

however,I get

site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast) 8694
other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
8695 ): -> 8696 raise InvalidIndexError 8697 8698 # slice me out of the other

InvalidIndexError:

I used this approach successfully in the past and I do not really understand why that one fails. Any ideas how to make it work?

EDIT

Here is an example which is very similar and works perfectly fine:

filler1 = pd.DataFrame({
    'key': list('AAABCCDD'),
    'prop1': list('xyzuyasj'),
    'prop2': list('mnbbbqwo')
})

tobefilled1 = pd.DataFrame({
    'key': list('AAABBCACDF'),
    'keep_me': ['stuff'] * 10,
    'prop1': [np.nan] * 10,
    'prop2': [np.nan] * 10,
    
})

filler1['g'] = filler1.groupby('key').cumcount()
tobefilled1['g'] = tobefilled1.groupby('key').cumcount()

filler1 = filler1.set_index(['key', 'g'])
tobefilled1 = tobefilled1.set_index(['key', 'g'])

print(tobefilled1.fillna(filler1))

prints

key g                    
A   0   stuff     x     m
    1   stuff     y     n
    2   stuff     z     b
B   0   stuff     u     b
    1   stuff   NaN   NaN
C   0   stuff     y     b
A   3   stuff   NaN   NaN
C   1   stuff     a     q
D   0   stuff     s     w
F   0   stuff   NaN   NaN

Solution

  • The problem here is the duplicate index defined in df1:

    df1 = pd.DataFrame({
        'key1': list('ABAACCA'),
        'key2': list('1675987'),
        'prop1': list('xyzuynb'),
        'prop2': list('mnbbbas')
    }).set_index(['key1', 'key2'])
    

    Note: Key1=A Key2=7 appears twice, the index for df1 is not unique.

    Let's change that second A7 to A9

    df1 = pd.DataFrame({
        'key1': list('ABAACCA'),
        'key2': list('1675989'),
        'prop1': list('xyzuynb'),
        'prop2': list('mnbbbas')
    }).set_index(['key1', 'key2'])
    
    df2 = pd.DataFrame({
        'key1': list('ABCCADD'),
        'key2': list('1598787'),
        'prop1': [np.nan] * 7,
        'prop2': [np.nan] * 7
    }).set_index(['key1', 'key2'])
    

    Thus creating unique indexing in df1, now try df.fillna:

    df2.fillna(df1)
    

    Output:

              prop1 prop2
    key1 key2            
    A    1        x     m
    B    5      NaN   NaN
    C    9        y     b
         8        n     a
    A    7        z     b
    D    8      NaN   NaN
         7      NaN   NaN
    

    I got hint of this when I tried the reindex_like method, first with unique indexing:

    df1 = pd.DataFrame({
        'key1': list('ABAACCA'),
        'key2': list('1675989'),
        'prop1': list('xyzuynb'),
        'prop2': list('mnbbbas')
    }).set_index(['key1', 'key2'])
    
    df2 = pd.DataFrame({
        'key1': list('ABCCADD'),
        'key2': list('1598787'),
        'prop1': [np.nan] * 7,
        'prop2': [np.nan] * 7
    }).set_index(['key1', 'key2'])
    print(df1.reindex_like(df2))
    

    Output:

              prop1 prop2
    key1 key2            
    A    1        x     m
    B    5      NaN   NaN
    C    9        y     b
         8        n     a
    A    7        z     b
    D    8      NaN   NaN
         7      NaN   NaN
    

    Now, let's revert to the original dataframes in the post:

    df1 = pd.DataFrame({
        'key1': list('ABAACCA'),
        'key2': list('1675987'),
        'prop1': list('xyzuynb'),
        'prop2': list('mnbbbas')
    }).set_index(['key1', 'key2'])
    
    df2 = pd.DataFrame({
        'key1': list('ABCCADD'),
        'key2': list('1598787'),
        'prop1': [np.nan] * 7,
        'prop2': [np.nan] * 7
    }).set_index(['key1', 'key2'])
    print(df1.reindex_like(df2))
    

    Output ValueError:

    ValueError: cannot handle a non-unique multi-index!
    

    Another work-around it to create unique indexing by adding another index level with cumcount.

    df1 = pd.DataFrame({
        'key1': list('ABAACCA'),
        'key2': list('1675987'),
        'prop1': list('xyzuynb'),
        'prop2': list('mnbbbas')
    }).set_index(['key1', 'key2'])
    
    df2 = pd.DataFrame({
        'key1': list('ABCCADD'),
        'key2': list('1598787'),
        'prop1': [np.nan] * 7,
        'prop2': [np.nan] * 7
    }).set_index(['key1', 'key2'])
    
    df1 = df1.set_index(df1.groupby(df1.index).cumcount(), append=True)
    df2 = df2.set_index(df2.groupby(df2.index).cumcount(), append=True)
    
    df2.fillna(df1)
    

    Output:

                prop1 prop2
    key1 key2              
    A    1    0     x     m
    B    5    0   NaN   NaN
    C    9    0     y     b
         8    0     n     a
    A    7    0     z     b
    D    8    0   NaN   NaN
         7    0   NaN   NaN
    

    Then you can drop index level 2:

    df2.fillna(df1).reset_index(level=2, drop=True)
    

    Output:

              prop1 prop2
    key1 key2            
    A    1        x     m
    B    5      NaN   NaN
    C    9        y     b
         8        n     a
    A    7        z     b
    D    8      NaN   NaN
         7      NaN   NaN
    

    However, I think pandas should have nicer error messaging for fillna non-unique MultiIndexes like it does for reindex_like.