Search code examples
pythonpandascluster-analysisrolling-computation

Clustering 1D vector with window


I am trying to identify clusters of 1s in a 1D vector. The problem I have is that the clusters that are separated by a number of zeros, that are less than a certain threshold, should be grouped together. Say, if I have two clusters separated by less than 3 zeros, they should be considered as one large cluster. For instance, the following vector:

[0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1]

should give me three clusters (with non-zero numbers indicating cluster ID):

[0,0,0,1,1,1,1,0,0,0,2,2,2,2,2,2,0,0,0,0,3,3,3]

I've been scratching my head for the entire day trying using rolling() in pandas and some custom-made functions, but can't come up with anything working.


Solution

  • Another solution, not requiring the use of .apply():

    import pandas as pd
    
    # Store the initial list in a pandas Series
    ser = pd.Series([0,0,0,1,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1])
    

    First, identify and number each consecutive group of 1's and 0's with the size of that group:

    grp_ser = ser.groupby((ser.diff() != 0).cumsum()).transform('size')
    print(grp_ser.to_list())
    # [3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 2, 2, 1, 1, 1, 1, 4, 4, 4, 4, 3, 3, 3]
    

    Using a copy of the original series, change value in rows where size of group is less than 3 and the original value is 0 to 1:

    ser_copy = ser.copy()
    ser_copy.loc[(grp_ser < 3) & ser.eq(0)] = 1
    print(ser_copy.to_list())
    # [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1]
    

    At this point, all clusters have been identified and just remain to be numbered consecutively.

    Create running sum that increments by 1 where 0's turn to 1's:

    res = ((ser_copy.diff() != 0) & (ser_copy != 0)).cumsum()
    print(res.to_list())
    # [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3]
    

    Replace 1's with 0's where the previous statement overrode the correct 0's:

    res[ser_copy == 0] = 0
    print(res.to_list())
    # [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 3, 3, 3]