I have a dataframe containing columns with groups of 1's and -1's. Each column could contain any amount of 1's or -1's grouped together within each column, and they could also begin with an unspecified number of NaN's. I would like to count the total of values within each group excluding the NaN's.
An example dataframe is below:
NaN = float('nan')
df = pd.DataFrame({'col1':[1,1,1,1,-1,-1,-1,1,1],
'col2':[NaN, NaN, NaN, 1,1,-1,-1,-1,-1],
'col3':[NaN,NaN,NaN,NaN,NaN,1,1,1,-1]})
I would like the output to be another dataframe as shown below:
df = pd.DataFrame({'col1':[4,4,4,4,3,3,3,2,2],
'col2':[NaN, NaN, NaN,2,2,4,4,4,4],
'col3': [NaN,NaN,NaN,NaN,NaN,3,3,3,1]})
Assuming a DataFrame/Series, you can group the successive values and use a groupby.transform('size')
:
df = pd.DataFrame({'col': [1,1,1,1,1,-1,-1,-1,-1,1,1,1]})
df['out'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
.transform('size')
)
Output:
col out
0 1 5
1 1 5
2 1 5
3 1 5
4 1 5
5 -1 4
6 -1 4
7 -1 4
8 -1 4
9 1 3
10 1 3
11 1 3
If your input is a list, not need for pandas, go with itertools.groupby
/itertools.chain
:
from itertools import groupby, chain
lst = [1,1,1,1,1,-1,-1,-1,-1,1,1,1]
out = list(chain.from_iterable([l:=len(list(g))]*l for _, g in groupby(lst)))
Output:
[5, 5, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3]
If you have a DataFrame with multiple columns, you can transform each column independently with apply
and mask
the NaNs:
NaN = float('nan')
df = pd.DataFrame({'col1':[1,1,1,1,-1,-1,-1,1,1],
'col2':[NaN, NaN, NaN, 1,1,-1,-1,-1,-1],
'col3':[NaN,NaN,NaN,NaN,NaN,1,1,1,-1]})
out = df.apply(lambda s: s.groupby(s.ne(s.shift()).cumsum())
.transform('size').mask(s.isna()))
Output:
col1 col2 col3
0 4 NaN NaN
1 4 NaN NaN
2 4 NaN NaN
3 4 2.0 NaN
4 3 2.0 NaN
5 3 4.0 3.0
6 3 4.0 3.0
7 2 4.0 3.0
8 2 4.0 1.0