Search code examples
pythonpandasmultidimensional-arrayindexingmulti-index

pandas: assign to multiindex using .loc with mask


Using the example from the MultiIndex / advanced indexing: Using slicers documentation.

def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]


miindex = pd.MultiIndex.from_product(
    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
)


micolumns = pd.MultiIndex.from_tuples(
    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")],
    names=["lvl0", "lvl1"]
)


dfmi = (
    pd.DataFrame(
        np.arange(len(miindex) * len(micolumns)).reshape(
            (len(miindex), len(micolumns))
        ),
        index=miindex,
        columns=micolumns,
    )
    .sort_index()
    .sort_index(axis=1)
)
>>> dfmi
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

In pseudo-code, what I want:

if D1/bar % 3 == 0 && D1/foo > 100:
    D0/bar = np.nan

Almost, but not quite there:

mask = ( (dfmi.loc[pd.IndexSlice[:,:,:,"D1"], ("a","bar")] % 3 == 0)
       & (dfmi.loc[pd.IndexSlice[:,:,:,"D1"], ("a","foo")] > 100))

dfmi.loc[pd.IndexSlice[:,:,:,"D0",mask], ("a","bar")] = np.nan

The issue is that at any given index level either a mask or a selector can apply - bot not both. For example, I can apply the mask at a different level. That requires the mask to be generated with a full index (no missing values) or re-aligned to the original index. How (not excluding other approaches)?


Later...

I really thought this would work as the innermost index should have half the rows, but for some reason it raises a ValueError. Anyone know why?

>>> dfmi.swaplevel(0,3).loc[pd.IndexSlice["D0",:,:,mask.values], ("a","bar")] = np.nan
...
ValueError: cannot index with a boolean indexer that is not the same length as the index

While this does work, I thought there would be a cleaner way to change index values. I thought I'd used index.set_levels successfully in the past. Anyone care to fix this up?

t = mask.reset_index()
t["level_3"] = "D0"
t = t.set_index(list(t.columns.values[:4]))
mask = t.reindex(dfmi.index).fillna(False)
dfmi.loc[mask[0], ("a","bar")] = np.nan

Solution

  • You could create a temporary multiIndex d0:

    d0 = dfmi.loc[pd.IndexSlice[:,:,:,"D0"], ('a','bar')]
    

    Next, use the boolean values from mask, combined with the mask method, to get your nulls:

    d0 = d0.mask(mask.array)
    

    Update the original dataframe with d0:

    dfmi.loc[d0.index, ('a', 'bar')] = d0