Search code examples
pythonpandasdataframegroup-by

pandas: add a grouping variable to clusters of rows that meet a criteria


I have this dataframe:

df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
              2, 2, False, 3, 3, 3, False]})

The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1, cluster_2, cluster_3 etc. I've given an example of the hoped for output with the column cluster_number. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this?


Solution

  • You can use a groupby.ngroup on the groups of successive values pre-filtered to the True:

    # group by successive values
    m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
    
    # filter groups of True, add group number
    # fill values with False
    df['cluster_number'] = (m[df['forms_a_cluster']]
                            .groupby(m).ngroup().add(1)
                            .reindex(df.index, fill_value=False)
                            )
    

    Or with arithmetics:

    m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
    df['cluster_number'] = (df['forms_a_cluster']
                            .mask(df['forms_a_cluster'],
                                  m//2 + df['forms_a_cluster'].iloc[0])
                           )
    

    Output:

        forms_a_cluster cluster_number
    0             False          False
    1             False          False
    2              True              1
    3              True              1
    4              True              1
    5             False          False
    6             False          False
    7             False          False
    8              True              2
    9              True              2
    10            False          False
    11             True              3
    12             True              3
    13             True              3
    14            False          False
    

    Other example:

        forms_a_cluster cluster_number
    0              True              1
    1              True              1
    2             False          False
    3              True              2
    4              True              2
    5             False          False
    6             False          False
    7             False          False
    8              True              3
    9              True              3
    10            False          False
    11             True              4
    12             True              4
    13             True              4
    14            False          False