Search code examples
pythonpandasmulti-index

How do I filter multi-level columns using notnull() in pandas?


I generate a multi-index dataframe that has some NAN values using this:

arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],["one", "two", "one", "two", "one", "two", "one", "two"],]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
a = np.random.randn(3, 8)
mask = np.random.choice([1, 0], a.shape, p=[.3, .7]).astype(bool)
a[mask] = np.nan
df = pd.DataFrame(a, columns=index)
df

Which will create something like this:

enter image description here

I'd like to get rows of a specific subset of top-level columns (eg df[['baz','qux']]) that have no nulls. For example in df[['baz','qux']] I'd like to get rows 0 and 1 since they both have all nulls in 3.

Hoping things would just work like a normal df I tried:

cols = ['bar','baz']
df[cols].loc[df[cols].notnull()]

But I obviously am missing something:

ValueError: Cannot index with multidimensional key

The pandas documentation for multiindex/advanced indexing illustrates how to index and slic this sort of dataframe but doesn't seem to have anything regarding .loc/lookups/filtering. So I assume I'm looking in the wrong place. But I am having trouble finding results or resources on this.


Solution

  • df[cols].notna() is not a 1D boolean mask. You have to reduce the dimension using all or any on axis.

    >>> df[df[cols].notna().all(1)]
    
            bar                 baz                 foo                 qux
            one       two       one       two       one       two       one       two
    0  1.799680 -0.901705 -1.575930  0.185863 -0.793007  1.485423       NaN       NaN
    2  1.379878 -0.748599  0.661697 -1.015311 -0.858144       NaN -1.623013  0.340043