python pandas cluster-analysis rolling-computation

Clustering 1D vector with window

I am trying to identify clusters of 1s in a 1D vector. The problem I have is that the clusters that are separated by a number of zeros, that are less than a certain threshold, should be grouped together. Say, if I have two clusters separated by less than 3 zeros, they should be considered as one large cluster. For instance, the following vector:

[0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1]

should give me three clusters (with non-zero numbers indicating cluster ID):

[0,0,0,1,1,1,1,0,0,0,2,2,2,2,2,2,0,0,0,0,3,3,3]

I've been scratching my head for the entire day trying using rolling() in pandas and some custom-made functions, but can't come up with anything working.

Solution

Another solution, not requiring the use of .apply():

import pandas as pd

# Store the initial list in a pandas Series
ser = pd.Series([0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1])

First, identify and number each consecutive group of 1's and 0's with the size of that group:

grp_ser = ser.groupby((ser.diff() != 0).cumsum()).transform('size')
print(grp_ser.to_list())
# [3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 2, 2, 1, 1, 1, 1, 4, 4, 4, 4, 3, 3, 3]

Using a copy of the original series, change value in rows where size of group is less than 3 and the original value is 0 to 1:

ser_copy = ser.copy()
ser_copy.loc[(grp_ser < 3) & ser.eq(0)] = 1
print(ser_copy.to_list())
# [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1]

At this point, all clusters have been identified and just remain to be numbered consecutively.

Create running sum that increments by 1 where 0's turn to 1's:

res = ((ser_copy.diff() != 0) & (ser_copy != 0)).cumsum()
print(res.to_list())
# [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3]

Replace 1's with 0's where the previous statement overrode the correct 0's:

res[ser_copy == 0] = 0
print(res.to_list())
# [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 3, 3, 3]