Search code examples
pythonpandascluster-analysis

Identifying the number of clusters in a Python DataFrame


Picture showing the clusters I wish to count

I am looking to identify the number of clusters of non-zeros in my DataFrame.

Here I have a DataFrame with four (4) clusters in total, but I have trouble finding a code, that can count them for me.

data = [
    [0,0,0,255,255,255,0,0],
    [0,255,0,255,255,255,0,0],
    [0,0,0,255,255,255,0,0,],
    [0,0,0,0,255,0,0,0],
    [0,255,255,0,0,255,0,0],
    [0,255,0,0,0,255,0,0],
    [0,0,0,0,0,255,0,0],
    [0,0,0,0,0,255,0,0]
]
    
df2 = pd.DataFrame(data)

Any help is appreciated!


Solution

  • I searched a bit myself and got this. It is a bit try and error without background knowledge but I changed the number of groups in your data a bit and skimage.measure always got the right result:

    import numpy as np
    from skimage import measure
    
    data = [
        [0, 0, 0, 255, 255, 255, 0, 0],
        [0, 255, 0, 255, 255, 255, 0, 0],
        [0, 0, 0, 255, 255, 255, 0, 0, ],
        [0, 0, 0, 0, 255, 0, 0, 0],
        [0, 255, 255, 0, 0, 255, 0, 0],
        [0, 255, 0, 0, 0, 255, 0, 0],
        [0, 0, 0, 0, 0, 255, 0, 0],
        [0, 0, 0, 0, 0, 255, 0, 0]
    ]
    arr = np.array(data)
    groups, group_count = measure.label(arr == 255, return_num = True, connectivity = 1)
    
    print('Groups: \n', groups)
    print(f'Number of groups: {group_count}')
    
    Output:
    
    Groups:
    [[0 0 0 1 1 1 0 0]
     [0 2 0 1 1 1 0 0]
     [0 0 0 1 1 1 0 0]
     [0 0 0 0 1 0 0 0]
     [0 3 3 0 0 4 0 0]
     [0 3 0 0 0 4 0 0]
     [0 0 0 0 0 4 0 0]
    Number of Groups: 4
    

    In measure.label you define what the criteria is. In your case arr==255 works or just simply arr>0 if the values are not always only 255. Connectivity needs to be set to 1 because you don't want clusters to be connected diagonally (if you do, set it to 2). If return_num = True the result is a tuple where the 2nd element is the number of different clusters.