Search code examples
stringalgorithmstring-algorithm

Efficient Approach to Decompressed String Substring Problem


The problem involves decompressing a compressed string and outputting the substring in the range of L and R. For example:

A(BB(CC)) -> ABBCCCCBBCCCC

L = 52, R = 60, ((((((((IMTOOSTRONG)))))))) -> IMTOOSTRONGIMTOOSTRONGIMTOOSTRONG...

MTOOSTRON

Here, the characters inside the parentheses are repeated, and the goal is to decompress the string and then output the substring in the given range [L, R].

The challenge is that the range [L, R] can go up to 1 billion, so decompressing the entire string would result in TLE (Time Limit Exceeded). I can’t figure out how to approach this problem efficiently.

How can I efficiently extract a substring from a nested compressed string without fully decompressing it, given that the range [L, R] can be as large as 1 billion?

I attempted to think about strategies like: Using recursion or a stack-based approach to simulate decompression. Counting the length of the decompressed string without fully expanding it, to decide which parts of the string are relevant for the [L, R] range.


Solution

  • I would suggest building a tree where the root represents the whole string, and every child represents either a substring represented by parentheses, or a literal string (without parentheses). The leaves of this tree are the literal strings.

    Each node would also store the size of the (sub)string it represents (when decompressed).

    To represent the repetition factor, you could add another node property that is either 1 or 2. Or, alternatively, you could define the same child node (reference) as a duplicated child of its parent: this creates two edges between the same node pair, so giving that relation a cardinality of 2. Note that this latter approach would turn the tree into a polytree.

    Finally define a substring method on nodes which can then use recursion to retrieve the string representations that are relevant for the given range.

    Here is an implementation in a JavaScript snippet:

    class Node {
        constructor() {
            this.length = 0; // Length of represented string (in uncompressed form)
            this.children = []; // List of strings and child nodes
        }
        addChild(child) { // child could be a Node or a string
            this.length += child.length;
            this.children.push(child);
        }
        substring(start, stop) { // `stop` is excluded from the range
            if (start >= this.length || stop <= 0 || start >= stop) return "";
            let size = stop - start;
            return this.children.map(child => {
                // If child is a string, the native substring method is called here,
                //    otherwise it is a recursive call:
                const s = child.substring(start, start + size);
                start -= child.length;
                return s;
            }).join("");
        }
    }
    
    function getTokenIterator(compressed) {
        // Return an iterator over the substrings that either 
        //    are "(", ")" or a series of other characters
        return compressed.match(/[^()]+|./gs).values();
    }
    
    function createGraph(tokens) {
        const node = new Node;
        while (true) {
            const token = tokens.next().value;
            if (!token || token === ")") return node;
            if (token === "(") {
                const child = createGraph(tokens);
                // Add child twice to reflect repetition
                node.addChild(child);
                node.addChild(child);
            } else {
                node.addChild(token);
            }
        }
    }
    
    function getSlice(compressed, left, right) {
        const tokens = getTokenIterator(compressed);
        const root = createGraph(tokens);
        // As `right` is inclusive in the range, add 1:
        return root.substring(left, right+1);
    }
    
    // Example runs
    console.log(getSlice("AA(BB(CC))", 3, 8)); // BCCCCB
    console.log(getSlice("((((((((IMTOOSTRONG))))))))", 52, 60));

    Complexity analysis

    Let 𝑛 be the length of the input string.

    The algorithm has these parts:

    1. Tokenizing: the input string is scanned from left to right, which is O(𝑛), creating the token strings, which also totals to O(𝑛).

    2. Tree building: For each token the processing has O(1) time complexity. This consist of either making a recursive call and attaching a node twice to the graph, or making a leaf node. So this totals to O(𝑛).

    3. Getting the substring: this involves traversing the polytree. Here the worst case complexity has two components:

      • It is determined by the size of the output string (i.e. π‘…βˆ’πΏ+1 assuming these are valid indices in the uncompressed string). In the worst case an input string consists of a deeply nested string (like your "strong example"), and if 𝐿=0 and 𝑅 is maximised, the whole compressed string might need to be expanded. Take for example 𝑛 as odd, and the first 𝑛/2 characters are opening parentheses, the middle character is "A", and the remaining 𝑛/2 characters are closing parentheses. Then the uncompressed string has 2𝑛/2 characters. As producing a string needs O(1) per character, the time complexity is O(2𝑛) for this scenario. If we include 𝐿 and 𝑅 in this expression, it becomes O(min(2𝑛, π‘…βˆ’πΏ)).

      • When π‘…βˆ’πΏ+1 is small, like 1, the complexity is not just O(1). There is the overhead of the polytree traversal. The traversal needs to find a leaf (for 𝐿), which in the worst case may involve visiting all internal nodes twice (because they are repeated), where each time a node's children are visited (only) upon the node's second visit. At some point we reach the leaf that is determined by 𝑅 and unwind out of recursion again. The work to go from 𝐿 and 𝑅 (while producing output) was discussed in the previous paragraph, so I omit it here.

      This means that the number of distinct nodes is a determining factor also, which is O(𝑛).

    Taking all this together the overall complexity can be expressed as:

    O(min(2𝑛, π‘…βˆ’πΏ) + 𝑛)