Evaluating the area of the largest square in a NxN-matrix that encompasses at most K ones (which are randomly placed)

I have an assigment which revolves around finding the area of the largest square that encompasses at most K ones in a NxN-matrix (the rest being zeroes, N <= 2000). An amount W of ones is randomly distributed over the matrix (So K <= W is implied). I have solved it using a binary search approach - The crux is that the assignment also tells me to solve the problem in under 2 seconds of computation time and mine isn't fast enough.

What Ive tried: A binary search algorithm over square sizes, starting with lower bound 1 and upper bound N-1 (the largest area being N*N is trivial since it only happens when K>=W). When a square encompassing at most K ones can be found for the square size (upper + lower)/2, it shifts the bounds upwards, otherwise downwards - The testing function for this is probably what causes the program to run for too long, as it still needs O(N²) time in the worst case to check one square size. Sadly, I'm pretty new to binary/n-ary searching and have no real approach to how to make this faster. I wondered if ternary/n-ary search would help. I've also read about parallelized binary search, but I'm not sure how to implement it. Any help is greatly appreciated.

Sadly I can't provide the code right now as it is on my office computer, but I'm looking for more general ideas on how to approach the problem anyway, not any concrete implementations.

Solution

As pointed out in the comments, we can't know whether an O(n^2 log n) solution can pass unless we know what hardware the online judge uses. It's possible that an O(n^2 log n) solution should be able to pass but you've written it in a way that makes the constant factor too large, for example by iterating over the matrix in the wrong way (which results in constant cache misses). So in that case it would be necessary to see your code to diagnose any performance bugs. But perhaps the judge is on some old machine where O(n^2 log n) is not supposed to pass. In that case, you can try an O(n^2) solution. I will describe one below.

For each square (i, j) in the N by N grid, we'll compute the largest M[i][j] such that a square of side length M[i][j] with its lower right corner at (i, j) has at most K ones. Also, we'll store the number of ones in that square - call it O[i][j]. The answer to the whole problem instance is then just the max of M[i][j] over all (i, j).

To compute these M[i][j] and O[i][j] values, we use the following algorithm:

// precompute row and column partial sums
int bestM = 0;
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        if (i == 0 || j == 0) {
            // this case is trivial 
        } else {
            // use the values of M[i-1][j-1] and O[i-1][j-1] to compute M[i][j] and O[i][j]
        }
        bestM = std::max(bestM, M[i][j]);
    }
}
return bestM;

To compute M[i][j] and O[i][j] from M[i-1][j-1] and O[i-1][j-1], we proceed as follows: we know that M[i][j] is at most M[i-1][j-1] + 1 (because a square of side length M[i-1][j-1] + 2 with lower right corner at (i, j) would strictly contain the square of side length M[i-1][j-1] + 1 with lower right corner at (i-1, j-1), but that square is known to be too big). So we have an inner loop:

int o = /* number of ones in square of side length M[i-1][j-1] + 1 with lower right corner at (i, j) */;
for (int k = M[i-1][j-1] + 1; k >= 0; k--) {
    if (o <= K) {
        O[i][j] = o;
        M[i][j] = k;
        break;
    } else {
        // reduce o to be the number of ones in square of side length k-1 with lower right corner at (i, j)
    }
}

To compute o efficiently, we observe that the square of side length M[i-1][j-1] + 1 with lower right corner at (i, j) is just the disjoint union of:

the square of side length M[i-1][j-1] with lower right corner at (i-1, j-1), and
an L-shape consisting of the lower and right edges of the square, with vertex at (i, j)

So the initial value of o is just the sum of the number of ones in region 1, namely O[i-1][j-1], and the number of ones in region 2. We can get the latter in constant time if we have precomputed partial sums along each row and each column, as indicated in the first pseudocode block. Similarly, every time o needs to be reduced in the inner loop, we just need to subtract the number of ones found in an L-shape that is the upper and left edges of the square of side length k, which we can likewise do using the partial sums.

The overall running time is O(N^2), despite the 3 nested loops, because the upper-left corner of the square currently being evaluated never moves backward, and so the total amount of time taken in the inner loop along each diagonal of the square is linear in the size of that diagonal, meaning the overall running time is linear in the total number of squares.