I am dealing with a multi-indexed dataframe that looks like this:
(sorry for writing null instead of NaN)
What could be the most efficient way to find occurrences of the patterns i highlighted?
I expect to reach a result like this one:
Thanks in advance for any insight!
For who wants to play with it:
from io import StringIO
import pandas as pd
df1_text = """ A B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""
df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)
Here is one approach. In short, you can use
df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())
The first line creates df2
that lines up the channel column with 9 columns of boolean indicators of cells that are not null, e.g.
# A B C
# STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1 False False False False True False False False False
# CH2 False False True True True False False False False
# CH3 True False True True False True False True False
# CH4 True False True True False True False True False
The goal of the second step is to replace each column in df2
with a string representing an event. Using the fact that Python strings can be multiplied by integers, we get
np.dot([CH1, CH2, CH3, CH4], [True, True, False, False]) <==>
'CH1' * True + 'CH2' * True + 'CH3' * False + 'CH4' * False <==>
'CH1' * 1 + 'CH2' * 1 + 'CH3' * 0 + 'CH4' * 0 <==>
'CH1' + 'CH2' <==>
'CH1CH2'
This has a cosmetic defect of omitting commas and including an "empty" event.
Full example:
from io import StringIO
import pandas as pd
df1_text = """ A B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""
df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)
# solution
df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())
# In [559]: df.swaplevel(0,1).unstack().notnull()
# Out[559]:
# A B C
# STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1 False False False False True False False False False
# CH2 False False True True True False False False False
# CH3 True False True True False True False True False
# CH4 True False True True False True False True False
# In [560]: np.dot(df2.index, df2)
# Out[560]:
# array(['CH3CH4', '', 'CH2CH3CH4', 'CH2CH3CH4', 'CH1CH2', 'CH3CH4', '',
# 'CH3CH4', ''], dtype=object)
# In [561]: pd.Series(np.dot(df2.index, df2)).value_counts()
# Out[561]:
# CH3CH4 3
# 3
# CH2CH3CH4 2
# CH1CH2 1
# dtype: int64