I have this dataframe:
df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
2, 2, False, 3, 3, 3, False]})
The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1
, cluster_2
, cluster_3
etc. I've given an example of the hoped for output with the column cluster_number
. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this?
You can use a groupby.ngroup
on the groups of successive values pre-filtered to the True
:
# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
.groupby(m).ngroup().add(1)
.reindex(df.index, fill_value=False)
)
Or with arithmetics:
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
.mask(df['forms_a_cluster'],
m//2 + df['forms_a_cluster'].iloc[0])
)
Output:
forms_a_cluster cluster_number
0 False False
1 False False
2 True 1
3 True 1
4 True 1
5 False False
6 False False
7 False False
8 True 2
9 True 2
10 False False
11 True 3
12 True 3
13 True 3
14 False False
Other example:
forms_a_cluster cluster_number
0 True 1
1 True 1
2 False False
3 True 2
4 True 2
5 False False
6 False False
7 False False
8 True 3
9 True 3
10 False False
11 True 4
12 True 4
13 True 4
14 False False