string algorithm parentheses cumulative-sum fenwick-tree

Length of longest chain of balanced parentheses in given ranges

First, define a string of "balanced" parentheses as a string such that for every '(' there is one unique, matching ')' somewhere after that '('.

For example, the following strings are all "balanced":

()

()()

(())()

while these are not:

)(

())(

Given a string of parentheses (length <= 1,000,000) and a list of range queries, find the maximum length of balanced parentheses within each of the ranges for each of the <= 100,000 queries (using 0-indexing for the ranges)

Ex:

()))()(())

Range: [0,3] -> Longest = 2 : "()"

Range: [0, 4] -> Longest = 2 : "()"

Range: [5, 9] -> Longest = 4: "(())"

My thoughts are as follows:

First, just determining whether a string is "balanced" can be done by maintaining a stack. If you encounter a '(', push into the stack and when you encounter a ')', pop out of the stack. If at the end any '(' remains, then the string is not "balanced."

However, repeating this for all the ranges is O(N*M) complexity which is too long for the size of the inputs.

Now, upon noticing the range queries, prefix sums and binary indexed trees/segment trees come to mind. If you can precompute the entire prefix sum range into an array, then you can find smaller prefix sums by taking the difference, which will have good big-o complexity.

I had the idea of assigning a +1 value to a '(' and a -1 value to a ')' so that way every time you encounter a '(' you add one to the cumulative sum and when you encounter a ')' you decrease. So for a valid, "balanced" string like ))() you would get: -1 -2 -1 -2.

However, my question is how do you use this to determine if it is "balanced"? Also, because you need to find the largest "balanced" string within a given interval, how do you use the ability to check if a given substring is "balanced" to find this largest within that given interval.

Solution

Introduction

For every starting bracket at position x, you want to find the matching closing bracket at position y, such that the substring from x to y, S[x, y], is the maximum substring that is balanced. We are not interested in substrings starting at closing brackets, because strings starting there are at most as good as strings starting at the first subsequent opening bracket.

The most important observation is the following: for every opening bracket starting at some position x' for x < x' < y, the matching closing bracket is at y' where x' < y' < y. This is because every prefix of S[x, y] contains at least as many opening brackets as closing brackets, so S[x', y] contains at most as many opening as closing brackets.

We can use this knowledge to build a tree, where each node represents a maximum balanced substring, so it has a starting position and an end position. The children of this node represent the balanced substrings of this top-level substring. Since there may be closing brackets that do not match an opening bracket, there is no single root, so we actually have a forest¹.

A picture says more than a thousand words. Consider this example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
) ) ( ( ) ( ( ) ) )  )  (  (  )  )

This would give the following tree:

   (2, 9)       (11, 14)
   /     \          |
(3, 4) (5, 8)   (12, 13)
          |
       (6, 7)

Building the tree

Building the tree is really easy. Start with an empty stack. Whenever you encounter an opening bracket at position x:

create a node (x, ..)
add that node as a child of the node currently on top of the stack (if it exists, otherwise this is a root)
push the new node on the stack

Whenever you encounter a closing bracket at position y:

pop the node (x, ..) that is on top of the stack (if there is no such node, this is an unmatched closing bracket: ignore)
set the closing index: (x, y)

You scan the string once and perform a constant number of operations in each step, so building the structure is done in linear time.

Querying the tree

Now you need to run your queries. Given a query (p, q), you need to find the largest node (where size is y - x + 1) that is fully contained in (p, q). Simply do a two binary searches to find the start and end position in your tree. (If the character at position p is a closing bracket you look for the smallest x > p. Similarly, if the character at position q is an opening bracket you look for the largest y < p.) Find the largest interval along the paths to x and y.

If your tree is nicely balanced, each query takes O(lg n) time. Worst case, the string starts with all opening brackets and ends with all closing brackets. Then the query may take up to linear time.

¹_{We could remedy this by adding as many opening brackets to the front of the string as there are unmatched closing brackets.}