Search code examples
pythonpandasnumpyscipy

Break up a sparse 2D array or table into multiple subarrays or subtables


I want to find a way to "lasso around" a bunch of contiguous/touching values in a sparse table, and output a set of new tables. If any values are "touching", they should be part of a subarray together.

For example: if I have the following sparse table/array:

[[0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0]
 [0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0]]

The algorithm should "find" subtables/subarrays. It would identify them like this:

[[0 0 0 1 1 0 0 0 2 2 2 2 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 0 2 2 2 2 2 0 0 3 3 0 0]
 [0 0 0 0 0 0 0 0 2 2 2 0 0 0 3 3 3 3 0]
 [0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0]]

But the final output should be a series subarrays/subtables like this:

[[1 1]
 [1 1]]
[[0 1 1 1 1 0]
 [0 1 1 1 1 1]
 [0 1 1 1 0 0]
 [1 0 0 0 0 0]]
[[0 0 1 1 0]
 [0 1 1 1 1]
 [1 0 0 0 0]]

How can I do this in python? I've tried looking at sk-image and a few things seem to be similar to what I'm trying to do, but nothing I have seen seems to fit quite right.

EDIT: it looks like scipy.ndimage.label is extremely close to what I want to do, but it will break the corner-case values into their own separate arrays. So it's not quite right. EDIT: ah ha, the structure argument is what I am after. If I get time I will update my question with an answer.


Solution

  • A possible solution, which based on the following ideas:

    • First, measure.label assigns a unique label to each connected component in the array based on an 8-connectivity criterion (connectivity=2).

    • Second, measure.regionprops retrieves properties of these labeled regions, such as their bounding boxes.

    • Then, the code iterates through each detected region, extracts the minimum and maximum row and column indices from the region's bounding box, and slices the original array a to obtain the corresponding subarray.

    
    labels = measure.label(a, connectivity=2)
    regions = measure.regionprops(labels)
    
    list_suba = []
    for region in regions:
        min_row, min_col, max_row, max_col = region.bbox
        subarray = a[min_row:max_row, min_col:max_col]
        list_suba.append(subarray)
    
    list_suba
    

    Or, more concisely:

    labels = measure.label(a, connectivity=2)
    regions = measure.regionprops(labels)
    
    [a[region.bbox[0]:region.bbox[2], region.bbox[1]:region.bbox[3]] 
     for region in regions]
    

    Output:

    [array([[1, 1],
            [1, 1]]),
     array([[0, 1, 1, 1, 1, 0],
            [0, 1, 1, 1, 1, 1],
            [0, 1, 1, 1, 0, 0],
            [1, 0, 0, 0, 0, 0]]),
     array([[0, 0, 1, 1, 0],
            [0, 1, 1, 1, 1],
            [1, 0, 0, 0, 0]])]