I am trying to identify clusters of 1s in a 1D vector. The problem I have is that the clusters that are separated by a number of zeros, that are less than a certain threshold, should be grouped together. Say, if I have two clusters separated by less than 3 zeros, they should be considered as one large cluster. For instance, the following vector:
[0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1]
should give me three clusters (with non-zero numbers indicating cluster ID):
[0,0,0,1,1,1,1,0,0,0,2,2,2,2,2,2,0,0,0,0,3,3,3]
I've been scratching my head for the entire day trying using rolling()
in pandas and some custom-made functions, but can't come up with anything working.
Another solution, not requiring the use of .apply()
:
import pandas as pd
# Store the initial list in a pandas Series
ser = pd.Series([0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1])
First, identify and number each consecutive group of 1's and 0's with the size of that group:
grp_ser = ser.groupby((ser.diff() != 0).cumsum()).transform('size')
print(grp_ser.to_list())
# [3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 2, 2, 1, 1, 1, 1, 4, 4, 4, 4, 3, 3, 3]
Using a copy of the original series, change value in rows where size of group is less than 3 and the original value is 0 to 1:
ser_copy = ser.copy()
ser_copy.loc[(grp_ser < 3) & ser.eq(0)] = 1
print(ser_copy.to_list())
# [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1]
At this point, all clusters have been identified and just remain to be numbered consecutively.
Create running sum that increments by 1 where 0's turn to 1's:
res = ((ser_copy.diff() != 0) & (ser_copy != 0)).cumsum()
print(res.to_list())
# [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3]
Replace 1's with 0's where the previous statement overrode the correct 0's:
res[ser_copy == 0] = 0
print(res.to_list())
# [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 3, 3, 3]