Search code examples
pandasmulti-index

How i can find non Nan occurrencies patterns in a multi index dataframe?


I am dealing with a multi-indexed dataframe that looks like this:

multi indexed data frame

(sorry for writing null instead of NaN)

What could be the most efficient way to find occurrences of the patterns i highlighted?

I expect to reach a result like this one:

pattern occurrences i am looking for

Thanks in advance for any insight!

For who wants to play with it:

from io import StringIO
import pandas as pd


df1_text = """       A  B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""

df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)

Solution

  • Here is one approach. In short, you can use

    df2 = df.swaplevel(0,1).unstack().notnull()
    print(pd.Series(np.dot(df2.index, df2)).value_counts())
    

    The first line creates df2 that lines up the channel column with 9 columns of boolean indicators of cells that are not null, e.g.

             # A                    B                    C
        # STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
    # CH1  False  False  False  False   True  False  False  False  False
    # CH2  False  False   True   True   True  False  False  False  False
    # CH3   True  False   True   True  False   True  False   True  False
    # CH4   True  False   True   True  False   True  False   True  False
    

    The goal of the second step is to replace each column in df2 with a string representing an event. Using the fact that Python strings can be multiplied by integers, we get

    np.dot([CH1, CH2, CH3, CH4], [True, True, False, False])      <==>
    'CH1' * True + 'CH2' * True + 'CH3' * False + 'CH4' * False   <==>
    'CH1' * 1 + 'CH2' * 1 + 'CH3' * 0 + 'CH4' * 0                 <==>
    'CH1' + 'CH2'                                                 <==>
    'CH1CH2'
    

    This has a cosmetic defect of omitting commas and including an "empty" event.

    Full example:

    from io import StringIO
    import pandas as pd
    
    
    df1_text = """       A  B C
    STAND1 CH1 NaN NaN NaN
    STAND1 CH2 NaN 11.2 NaN
    STAND1 CH3 12.4 7.0 NaN
    STAND1 CH4 10.2 2.0 NaN
    STAND2 CH1 NaN 2.5 NaN
    STAND2 CH2 NaN 11.2 NaN
    STAND2 CH3 NaN NaN 6.3
    STAND2 CH4 NaN NaN 23.5
    STAND3 CH1 NaN NaN NaN
    STAND3 CH2 12.3 NaN NaN
    STAND3 CH3 5.3 4.5 NaN
    STAND3 CH4 7.2 25.6 NaN"""
    
    df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)
    
    # solution
    df2 = df.swaplevel(0,1).unstack().notnull()
    print(pd.Series(np.dot(df2.index, df2)).value_counts())
    
    # In [559]: df.swaplevel(0,1).unstack().notnull()
    # Out[559]:
             # A                    B                    C
        # STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
    # CH1  False  False  False  False   True  False  False  False  False
    # CH2  False  False   True   True   True  False  False  False  False
    # CH3   True  False   True   True  False   True  False   True  False
    # CH4   True  False   True   True  False   True  False   True  False
    
    # In [560]: np.dot(df2.index, df2)
    # Out[560]: 
    # array(['CH3CH4', '', 'CH2CH3CH4', 'CH2CH3CH4', 'CH1CH2', 'CH3CH4', '',
           # 'CH3CH4', ''], dtype=object)
    
    # In [561]: pd.Series(np.dot(df2.index, df2)).value_counts()
    # Out[561]: 
    # CH3CH4       3
                 # 3
    # CH2CH3CH4    2
    # CH1CH2       1
    # dtype: int64