Search code examples
javascriptregex

How can I obtain just the replacement text from a template I would pass to `String.prototype.replace`?


I'm working on a project to allow users to perform dynamic regular expression replacements of strings. The primary "mode of operation" is intended to be as follows:

  • User enters a search string which includes capture-groups ((foo))
  • User enters a replacement string which includes capture-group references ($1, $2, ...)
  • The final output string follows the format of the replacement string exactly (somewhat comparable to printf string formatting in C), as opposed to the format of the input string.

So, for example, the following parameters/output would be expected:

What? Value
Input abc def ghi jkl mno
Search (\S+) ghi (\S+)
Replacement (*) $1 - $2
Output def - jkl

(*) It seems the term "replacement" leads to confusion for some. To clarify, this is NOT quite the same as a replacement pattern such as one used with e.g. x.replace(), but more like a formatting / output pattern, or, more verbosely: a pattern where you build an entirely new string from scratch, using references to substrings you've captured with the search pattern, ignoring whatever other text is in the original input string. In other words, it's not an inline replacement, but a full replacement.

The idea behind this, is that the user would only have to enter the bare minimum pattern to capture the exact parts they want, rather than (often times) including parts into the search pattern that are just omitted in the replacement.

In the example above, simply adding .* to the beginning and end of the Search parameter and using String.prototype.replace() would essentially accomplish precisely this. However, manipulating the original user-supplied search-string seems like not the best idea, and prone to a lot of gotchas since regex patterns can be quite complex.

So, I was wondering if there are any internal utility functions/methods readily available in JavaScript to safely perform this kind of regex string transformation? I looked around for a while, but maybe I'm just not using the right keywords... Even if there isn't an elegant approach primarily using core JavaScript functions, other (safe!) approaches are still welcome of course.


Solution

  • The algorithm you are (apparently) looking for is named GetSubstitution in the ECMAScript specification, and as of the 13th edition (ECMAScript 2022) it is not exposed directly to user code. As of writing this answer there is not even a stage-zero proposal to expose it either. This means that any solution to your problem will necessarily involve either non-portable APIs (the existence of which I am not aware myself), reimplementing GetSubstitution yourself (like in @jav974’s answer) or the abstraction inversion anti-pattern like the following:

    const getSubstitution = (haystack, needle, replacement) => {
        if (!needle instanceof RegExp)
            throw new TypeError('needle must be a regexp');
        if (needle.global)
            throw new TypeError('global flag must be disabled');
        const needleSticky = new RegExp(
            needle.source, needle.flags.replace('y', '') + 'y');
        const m = needle.exec(haystack);
        if (!m)
            return null;
        needleSticky.lastIndex = m.index;
        const r0 = haystack.replace(needleSticky, replacement);
        return r0.substring(
            m.index, r0.length - (haystack.length - (m.index + m[0].length)));
    };
    
    console.log(getSubstitution(
        'abc def ghi jkl mno 123fff45678',
        /(?<ah>\S+) ghi (\S+)/u, '([b4=$`] $1 - $2 $<ah> [af=$\'])'));

    A version that handles .global patterns is even more involved:

    const getSubstitutionsGlobal = function* (haystack, needle, replacement) {
        if (!needle instanceof RegExp)
            throw new TypeError('needle must be a RegExp');
        const needleSticky = new RegExp(
            needle.source, needle.flags.replace(/[gy]/g, '') + 'y');
        needle.lastIndex = 0;
        for (const m of haystack.matchAll(needle)) {
            needleSticky.lastIndex = m.index;
            const r0 = haystack.replace(needleSticky, replacement);
            yield r0.substring(
                m.index, r0.length - (haystack.length - (m.index + m[0].length)));
        }
    };
    
    console.log(Array.from(getSubstitutionsGlobal(
        'abc def ghi jkl mno 123fff45678 abc 123 ghi 456 mno',
        /(?<ah>\S+) ghi (\S+)/ug, "([b4=$`] $1 - $2 [ah=$<ah>] [af=$'])")));

    This is inefficient as the pattern is matched against the string twice, once in RegExp.prototype.exec and again in RegExp.prototype[Symbol.replace] (though this particular inefficiency is ameliorated by using the sticky variant for the latter, so that at least the position of the match is not computed twice), and the .replace method concatenates the text surrounding the match only for it to be discarded later. Expect this solution to perform poorly with long input strings.

    Reimplementing GetSubstitution yourself does not run into those problems, and given how strongly backwards-compatible JavaScript engines tend to be, the only way you should expect it to diverge from a native implementation is if a new edition of the language introduces a new flag or gives meaning to previously-invalid RegExp syntax (like happened with named groups), which is the only way the semantics of replacement interpolation could change; but if that happens, there is no telling if the above solution will keep working either. As such, I find @jav974’s answer preferable myself.