Search code examples
javascriptiteratorv8iterablecallstack

Maximum call stack size exceeded for large iterators


I'm trying to convert an iterator to an array. The iterator is the result of calling matchAll on a very long string. The iterator (I assume) has many matches within the string. First I tried it with the spread operator:

const array = [...myLongString.matchAll(/myregex/g)];

This gave me the error: RangeError: Maximum call stack size exceeded

So I tried iterating via next():

const safeIteratorToArray = (iterator) => {
    const result = [];
    let item = iterator.next();

    while (!item.done) {
        result.push(item.value);
        item = iterator.next();
    }

    return result;
};

But this gives me the same error, on the item = iterator.next() line. So I tried making it async in an effort to reset the call stack:

const safeIteratorToArray = async (iterator) => {
    const result = [];
    let item = iterator.next();

    while (!item.done) {
        result.push(item.value);
        item = await new Promise(resolve => setTimeout(() => resolve(iterator.next())));
    }

    return result;
};

But I still get the same error.

If you are curious about the actual use case:

The regex I'm actually using is:

 /\[(.+?)\] \[DEBUG\] \[Item (.+?)\] Success with response: ((.|\n)+?)\n\[/g

And the contents of the text file (it's a log file) generally looks like:

[TIMESTAMP] [LOG_LEVEL] [Item ITEM_ID] Success with response: {
    ...put a giant json object here
}

Repeat that ad-nauseam with newlines in between each log.


Solution

  • (V8 developer here.)

    It's not about the iterator, it's about the RegExp.

    [Update]
    Looks like I was misled by a typo in my test, so my earlier explanation/suggestion doesn't fix the problem. With the test corrected, it turns out that only the end of the expression (which I called "fishy" before) needs to be fixed.

    The massive consumption of stack memory is caused by the fact that (.|\n) is a capture group, and is matched very frequently. One idea would be to write it as [.\n] instead, but the . metacharacter is not valid inside [...] character sets.

    Hat tip to @cachius for suggesting an elegant solution: use the s flag to make . match \n characters.
    As an unrelated fix, prefer checking for the closing } instead of the next opening [ at the beginning of a line, so that there's no overlap between matched ranges (which would make you miss some matches).
    So, in summary, replace ((.|\n)+?)\n\[/g with (.+?)\n}/gs.

    [/Update]

    Here is a reproducible example. The following exhibits the stack overflow:

    let lines = ["[TIMESTAMP] [DEBUG] [Item ITEM_ID] {"];
    for (let i = 0; i < 1000000; i++) {
      lines.push("  [0]");  // Fake JSON, close enough for RegExp purposes.
    }
    lines.push("}");
    lines.push("[TIMESTAMP]");
    let myLongString = lines.join("\n");
    
    const re = /\[(.+?)\] \[DEBUG\] \[Item (.+?)\] ((.|\n)+?)\n\[/g;
    myLongString.match(re);
    

    If you replace the const re = ... line with:

    const re = /\[(.+?)\] \[DEBUG\] \[Item (.+?)\] (.+?)\n}/gs;
    

    then the stack overflow disappears.

    (It would be possible to reduce the simplified example even further, but then the connection with your original case wouldn't be as obvious.)

    [Original post below -- the mechanism I explained here is factually correct, and applying the suggested replacement indeed improves performance by 25% because it makes the RegExp simpler to match, it just isn't enough to fix the stack overflow.]
    The problematic pattern is:

    \[(.+?)\]
    

    which, after all, means "a [, then any number of arbitrary characters, then a ]". While I understand that regular expressions might seem like magic, they're actually real algorithmic work under the hood, kind of like miniature programs in their own right. In particular, any time a ] is encountered in the string, the algorithm has to decide whether to count this as one of the "arbitrary characters", or as the one ] that ends this sequence. Since it can't magically know that, it has to keep both possibilities "in mind" (=on the stack), pick one, and backtrack if that turns out to be incorrect. Since this backtracking information is kept on the stack (where else?), if you put sufficiently many ] into your string, the stack will run out of space.

    Luckily, the solution is simple: since what you actually mean is "a [, then any number of characters that aren't ], then a ]", you can just tell the RegExp engine that, replacing . with [^\]]:

    \[([^\]]+?)\]
    

    Note: ((.|\n)+?)\n\[ seems fishy for the same reason, but according to this test doesn't appear to be the problem, even if I further increase the input size. I'm not sure why; it might be due to how I created the test. If you see further problems with the real input, it may be worth reformulating this part as well.
    [/Original post]