I'm trying to agglomerate adjacent cells (and their neighbours) that have the same type (integer from 1 to 10) into new clusters by assigning them to a cluster id.
As visualised here for some of the clusters:
Currently, I use an abbreviation from Breadth-First search to go through all neighbours and their neighbours and then assign a cluster-id to all found neighbours of the same type. Since my data set is fairly big (<3 Million grid cells) this operation would take days in its current form.
My function that I run in a loop choosing cells by chance until all cells have been assigned a cluster id:
def bfs(df, grid_id):
visited = []
queue = []
visited.append(grid_id)
queue.append(grid_id)
base_pred = df.loc[grid_id, "class"]
while queue:
_ = queue.pop(0)
n, e = get_n_e_index(grid_id)
neighbours = get_same_neighbours(base_pred, n, e, agglo_df)
for neighbour in neighbours:
if neighbour not in visited:
visited.append(neighbour)
queue.append(neighbour)
queue = queue + get_same_neighbours(base_pred, n, e, agglo_df)
return visited
visited
is used to assign the same cluster_id
get_same_neighbours
just returns neighbours of a given cell that have the same type
Before I ended up validating the approach I came to the conclusion that it's ultimately to slow and was wondering if anybody knows of an algorithm fast enough to solve this. Searching online for a solution didn't help.
Edit: That it took so long wasn't an issue of the algorithm, but me forgetting to reduce the possible start cells to only those who haven't been touched before.
Edit2:
get_n_e_index
just converts a string ID to North East Integers
def get_same_neighbours(base_class, north, east, df):
neighbours = []
grid_ids = df["ids"].values
for i in [-1, 1, 0]:
for j in [-1, 1, 0]:
n = int(north + i)
e = int(east + j)
grid_id = get_grid_id(n, e)
invalid = [(-1, 1), (1, 1), (0, 0), (1, -1), (-1, -1)] # invalid combinations
if grid_id in grid_ids and (j, i) not in invalid and base_class == df.loc[grid_id, "class"]:
neighbours.append(grid_id)
return neighbours
get_grid_id
converts north (n) and east (e) integers to the string id of the grid cell.
You're making a couple of mistakes. Note that for ~3 million cells, this should work within about a second.
As pointed out in the comments, visited
should be a set for O(1)
lookup.
Second, queue = queue + get_same_neighbours(base_pred, n, e, agglo_df)
can get very slow if the queue gets very long, as it creates a copy of the queue. Instead, simply append
the new items.
Finally, your queue
is not actually a queue, but a list. Therefore, popping the first element takes O(n)
time. Instead you can use a queue (deque
in python), or you can pop the last element instead, which will turn Breadth-firsth search into Depth-first search, which should work just as well.
Apart from this, I think the only remaining problems are in the graph/grid representation. Storing it as a dataframe seems to often require parsing the entire dataframe to find cells. If you store it as a multidimensional array (be it list
s or numpy array
s) or even as a dict
, you can find any cell in O(1)
time. Therefore I'd suggest first preprocessing the data from the dataframe into another data structure that grants easy access.