Search code examples
pythonpandasmulti-index

Update MultiIndex-DataFrame with a DataFrame which MultiIndex is a subset of the first


I have two panda DataFrames x and y each with a MultiIndex. The MultiIndex of y is a subset of x. I would like to update fields in x using values of y:

x.index.names
Out[]: FrozenList(['cohort', 'id', 'design', 'date'])

y.index.names
Out[]: FrozenList(['cohort', 'id'])

Ho can I do that?


Example:

import pandas as pd

The DataFrame x:

# sets of different measurements on different subjects on different
# dates.

df = pd.read_pickle('protocol.pkl')

df.set_index(
        keys=['cohort', 'id', 'design', 'date'],
        inplace=True,
        verify_integrity=True,
        drop=True)

df.head()
Out[]:
                             valid  epi
cohort id design date
FOOBAR 1  FOO    2014-04-22   True    3
       2  BAR    2014-04-24   True    3
       2  BAR    2014-04-25   True    3
       4  FOO    2014-04-25   True    3
       4  BAR    2014-05-05   True    3

df.shape
Out[]: (714, 2)

The DataFrame y:

# subjects to exclude from the study

up = pd.read_pickle('outlying.pkl')

up.set_index(keys=['cohort', 'id', 'design'],
        inplace=True,
        verify_integrity=True,
        drop=True)

up.head()
Out[]:
                     valid
cohort id  design
FOOBAR 1   BAR       False
       2   BAR       False
       12  BAR       False
       22  FOO       False
       28  FOO       False

up.head()
Out[]: (14, 1)

The result of the update should be:

df.head()
Out[]:
                             valid  epi
cohort id design date
FOOBAR 1  FOO    2014-04-22   True    3
       2  BAR    2014-04-24   False   3
       2  BAR    2014-04-25   False   3
       4  FOO    2014-04-25   True    3
       4  BAR    2014-05-05   True    3

I was hoping

df.update(up)

does the trick, as the index of up is a "subset" of df, but it has no effect on df.


Solution

  • I try reindex with Multiindex, but get:

    TypeError: Join on level between two MultiIndex objects is ambiguous

    So possible solution is join with assign and fillna for replace NaNs:

    df = (df.reset_index().join(up, on=['cohort','id','design'], lsuffix='_')
            .assign(valid=lambda x: x.valid.fillna(x.valid_))
            .drop('valid_', axis=1)
            .set_index(['cohort','id','design', 'date'])
           )
    print (df)
                                 epi  valid
    cohort id design date                  
    FOOBAR 1  FOO    2014-04-22    3   True
           2  BAR    2014-04-24    3  False
                     2014-04-25    3  False
           4  FOO    2014-04-25    3   True
              BAR    2014-05-05    3   True