Agglomerate adjecent cells and their neighbours of the same type to clusters with python

I'm trying to agglomerate adjacent cells (and their neighbours) that have the same type (integer from 1 to 10) into new clusters by assigning them to a cluster id.

As visualised here for some of the clusters:

Currently, I use an abbreviation from Breadth-First search to go through all neighbours and their neighbours and then assign a cluster-id to all found neighbours of the same type. Since my data set is fairly big (<3 Million grid cells) this operation would take days in its current form.

My function that I run in a loop choosing cells by chance until all cells have been assigned a cluster id:

def bfs(df, grid_id):
    visited = []
    queue = []
    visited.append(grid_id)
    queue.append(grid_id)
    base_pred = df.loc[grid_id, "class"]
    while queue:
        _ = queue.pop(0)
        n, e = get_n_e_index(grid_id)
        neighbours = get_same_neighbours(base_pred, n, e, agglo_df)
        for neighbour in neighbours:
            if neighbour not in visited:
                visited.append(neighbour)
                queue.append(neighbour)
                queue = queue + get_same_neighbours(base_pred, n, e, agglo_df)

    return visited

visited is used to assign the same cluster_id

get_same_neighbours just returns neighbours of a given cell that have the same type

Before I ended up validating the approach I came to the conclusion that it's ultimately to slow and was wondering if anybody knows of an algorithm fast enough to solve this. Searching online for a solution didn't help.

Edit: That it took so long wasn't an issue of the algorithm, but me forgetting to reduce the possible start cells to only those who haven't been touched before.

Edit2:

get_n_e_index just converts a string ID to North East Integers

def get_same_neighbours(base_class, north, east, df):
    neighbours = []
    grid_ids = df["ids"].values
    for i in [-1, 1, 0]:
        for j in [-1, 1, 0]:
            n = int(north + i)
            e = int(east + j)
            grid_id = get_grid_id(n, e)
            invalid = [(-1, 1), (1, 1), (0, 0), (1, -1), (-1, -1)] # invalid combinations
            if grid_id in grid_ids and (j, i) not in invalid and base_class == df.loc[grid_id, "class"]:
                neighbours.append(grid_id)
    return neighbours

get_grid_id converts north (n) and east (e) integers to the string id of the grid cell.

Solution

You're making a couple of mistakes. Note that for ~3 million cells, this should work within about a second.

As pointed out in the comments, visited should be a set for O(1) lookup.

Second, queue = queue + get_same_neighbours(base_pred, n, e, agglo_df) can get very slow if the queue gets very long, as it creates a copy of the queue. Instead, simply append the new items.

Finally, your queue is not actually a queue, but a list. Therefore, popping the first element takes O(n) time. Instead you can use a queue (deque in python), or you can pop the last element instead, which will turn Breadth-firsth search into Depth-first search, which should work just as well.

Apart from this, I think the only remaining problems are in the graph/grid representation. Storing it as a dataframe seems to often require parsing the entire dataframe to find cells. If you store it as a multidimensional array (be it lists or numpy arrays) or even as a dict, you can find any cell in O(1) time. Therefore I'd suggest first preprocessing the data from the dataframe into another data structure that grants easy access.