Search code examples
algorithmgridgraph-algorithmbreadth-first-searchflood-fill

finding largest region with small difference more efficiently


There is an grid consisting of h * w (h, w <= 200) pixels, every pixel is represented by a value, we want to find the largest continuous region.
A continuous region is defined in this way:

  1. Given a point P(x, y), The connected region must include this point.
  1. There exists reference point R(x, y) of value v, any point in the connected region must be connected to this point. Also, there is a value g_critical(g_critical <= 100000). Let the value of a point in the connected region be v, the difference of u and v must be smaller or equal than g_critical.

The question is to find the size of the largest connected region.

For example the grid. h = 5, w = 5, g_critical = 3, P(x, y) = (2, 4)

1 3 7 9 2
2 5 6 6 8
3 5 9 3 6
2 7 3 2 9

In this case, the bold region is the largest connected region. Notice that R(x, y) is chosen at (2, 3) or (2, 2) in this case. The size of the region is 14.

I have rephrased the question a bit so it is shorter. So if there is any ambiguity, please point it out in the comment. This question is also in our private judge so I am unable to share the problem source here.

My attempt

I have tried to loop through every cell, consider it as the R point and use bfs to find the connected region attached to it. Then, check if P is contained in the region.

The complexity is O(h * h * w * w), which is too large. So any way to optimize it?

I am guessing that maybe starting with p will help, but I am not sure how I should do it. Maybe there is some kind of flood fill algorithms that allow me to do it?

Thanks in advance.


Solution

  • There's an O(h w √(g_critical) α(h w))-time algorithm (where α is the inverse Ackermann function, constant for practical purposes) that uses a disjoint set data structure with an "undo" operation and a variant of Mo's trick. The idea is, decompose the interval [v − g_critical, v] into about √g_critical subintervals of length about √g_critical. For each subinterval [a, b], prepare a disjoint set data structure representing the components of the matrix where the allowed values are [b, a + 2 g_critical]. Then for each c in [a, b], extend the disjoint set with points whose values lie in [c, b) and (a + 2 g_critical, c + 2 g_critical] and report the number of nodes in the component of P(x,y), then undo these operations (keep a stack of the writes made to the structure, with original values; then pop each one, writing the original values).

    There's also an O(h w log(h w))-time algorithm that you're not going to like because it uses dynamic trees. (Sleator–Tarjan's 1985 construction based on splay trees is the simplest and works OK here.) Posting it mainly in case it inspires a more practical approach.

    The high-level idea is a kinetic algorithm that "slides" the interval of allowed values over the at most g_critical + 1 possibilities, repeatedly reporting and taking the max over the size of the connected component containing P.

    To do this, we need to maintain the maximum spanning forest on a derived graph. Given a node-weighted undirected graph, construct a new, edge-weighted graph by subdividing each edge and setting the weight of each new edge to the weight of the old node to which it is incident. Deleting the lightest nodes in the graph is straightforward – all paths avoid these nodes if possible, so the new maximum spanning forest won't have any more edges. To add the lightest nodes not in the graph, try to link their incident edges one at a time. If a cycle would form, evert one endpoint, query the path minimum from the other endpoint, and unlink that minimum.

    To report the size of the component containing P we need another decoration that captures the size of the concrete subtree (as opposed to the represented subtree) of each node. The details get a bit gnarly.