I have two panda DataFrames x
and y
each with a MultiIndex. The MultiIndex of y
is a subset of x
. I would like to update fields in x
using values of y
:
x.index.names
Out[]: FrozenList(['cohort', 'id', 'design', 'date'])
y.index.names
Out[]: FrozenList(['cohort', 'id'])
Ho can I do that?
Example:
import pandas as pd
The DataFrame x
:
# sets of different measurements on different subjects on different
# dates.
df = pd.read_pickle('protocol.pkl')
df.set_index(
keys=['cohort', 'id', 'design', 'date'],
inplace=True,
verify_integrity=True,
drop=True)
df.head()
Out[]:
valid epi
cohort id design date
FOOBAR 1 FOO 2014-04-22 True 3
2 BAR 2014-04-24 True 3
2 BAR 2014-04-25 True 3
4 FOO 2014-04-25 True 3
4 BAR 2014-05-05 True 3
df.shape
Out[]: (714, 2)
The DataFrame y
:
# subjects to exclude from the study
up = pd.read_pickle('outlying.pkl')
up.set_index(keys=['cohort', 'id', 'design'],
inplace=True,
verify_integrity=True,
drop=True)
up.head()
Out[]:
valid
cohort id design
FOOBAR 1 BAR False
2 BAR False
12 BAR False
22 FOO False
28 FOO False
up.head()
Out[]: (14, 1)
The result of the update should be:
df.head()
Out[]:
valid epi
cohort id design date
FOOBAR 1 FOO 2014-04-22 True 3
2 BAR 2014-04-24 False 3
2 BAR 2014-04-25 False 3
4 FOO 2014-04-25 True 3
4 BAR 2014-05-05 True 3
I was hoping
df.update(up)
does the trick, as the index of up is a "subset" of df, but it has no effect on df.
I try reindex with Multiindex
, but get:
TypeError: Join on level between two MultiIndex objects is ambiguous
So possible solution is join
with assign
and fillna
for replace NaN
s:
df = (df.reset_index().join(up, on=['cohort','id','design'], lsuffix='_')
.assign(valid=lambda x: x.valid.fillna(x.valid_))
.drop('valid_', axis=1)
.set_index(['cohort','id','design', 'date'])
)
print (df)
epi valid
cohort id design date
FOOBAR 1 FOO 2014-04-22 3 True
2 BAR 2014-04-24 3 False
2014-04-25 3 False
4 FOO 2014-04-25 3 True
BAR 2014-05-05 3 True