Search code examples
excelregexpowerbipowerquerym

JS Hack for Native Regex in Power Query for Excel


Background

I wish to perform regex detection (and matching?) in Power Query (M) for Excel, which lacks native support for regex. Unlike Power BI, Excel lacks the integrations with R and Python, so I cannot use them to outsource the regex operations.

Approach

Inspired by this clever hack from u/tirlibibi17, I have generalized a native solution by integrating M with JavaScript (JS) and its regex functionality. See the Solution section below for my full work.

While I have successfully tested Rgx_Test() on a single text string, it bogs down at scale, because it regenerates the Web.Page() for every string being tested. With the plural Rgx_Tests(), I aspire to a operate on a batch of many such strings, all in one fell swoop.

My "intended script" for Rgx_Tests() should improve on the "current script" for Rgx_Test(). The native Json.*() family in M facilitates the exchange of batch data with the JS script, as represented in JSON notation. For input, a list of texts in M becomes an Array of Strings in JS. For output, an Array of Booleans in JS becomes a list of logicals in M.

Problem

Unfortunately, a mysterious failure has cut short my progress on Rgx_Tests():

  • While my current script properly renders a Boolean via document.write(), my intended script fails to likewise render a String of JSON text. This String is generated by JSON.stringify() from the Array of Booleans.

  • This problem occurs only in M, and not in JS, as demonstrated (below) by the intended script in isolation. Presumably the fault lies with Web.Page(), but why should it render a Boolean yet fail to render a String?

  • No error is thrown. The current script renders the Boolean in a text block, where I can easily find it. With the intended script, this text block is simply missing.

See the Debugging section below for details.

Question

What is the origin of this mysterious failure, and how do I fix it?

I do know that a JS script, with errors in syntax or at runtime, will yield this behavior. However, that JS script (below) runs perfectly on its own. Perhaps functions like JSON.stringify() are incompatible with the version of JS that M is using?


Debugging

Current Script for Rgx_Test()

My current script works smoothly, and renders raw text for M to interpret as a logical.

I take the single text string in M, and convert it via my helper Json_Stringify()...

Json_Stringify("A1")

...into its JSON representation...

"A1"

...which I "inject" (splice) into the JS script below.

<html><body><script>
    //        ˇˇˇˇ
    var txt = "A1";
    //        ^^^^
    var ptn = "[a-z]\\d";
    var flg = "i";
    var rgx = new RegExp(ptn, flg);
    var tst = rgx.test(txt);
    document.write(tst);
</script></body></html>

It renders the result as raw text...

true

...which I drill and extract via Web.Page():

7

I then parse it via Logical.FromText()...

Logical.FromText("true")

...to yield a logical result.

true

Intended Script for Rgx_Tests()

I aspire to the script below, which succeeds in JS but fails to render JSON text for M to interpret as a list of logicals.

I take a list of texts in M, and convert it via my helper Json_Stringify()...

Json_Stringify({"A1", "2b", "c_3", "d4_"})

...into its JSON representation...

["A1","2b","c_3","d4_"]

...which I "inject" (splice) into the JS script below:

<html><body><script>
    //         ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ
    var txts = ["A1","2b","c_3","d4_"];
    //         ^^^^^^^^^^^^^^^^^^^^^^^
    var ptn = "[a-z]\\d";
    var flg = "i";
    var rgx = new RegExp(ptn, flg);
    function rgx_test(txt) {return rgx.test(txt)};
    var tsts = txts.map(rgx_test);
    var out = JSON.stringify(tsts);
    document.write(out);
</script></body></html>

It should render the results as (JSON) text...

[true,false,false,true]

...which I could drill and extract via Web.Page():

9

I could then parse via Json.Document()...

Json.Document("[true,false,false,true]")

...to yield a list of logical results.

{true, false, false, true}

Failure

Unfortunately, the corresponding record for this text block is entirely absent from the table output by Web.Page():

11

Intended Rgx_Tests()

This should test a list of text strings against a regex pattern, and return a list of the logical results.

let Rgx_Tests = (
    texts as list,
    pattern as text,
    optional insensitive as nullable logical
) as nullable list =>
    let
        txts = Json_Stringify(texts),
        ptn = Json_Stringify(pattern),
        flg = Json_Stringify(Rgx_Flags(false, false, insensitive)),
        
        html = "<html><body><script>
            var txts = " & txts & ";
            var ptn = " & ptn & ";
            var flg = " & flg & ";
            var rgx = new RegExp(ptn, flg);
            function rgx_test(txt) {return rgx.test(txt)};
            var tsts = txts.map(rgx_test);
            var out = JSON.stringify(tsts);
            document.write(out);
        </script></body></html>",
        
        page = Web.Page(html),
        out = page{0}[
            Data
        ]{[
            Name = "HTML"
        ]}[Children]{[
            Name = "BODY"
        ]}[Children]{[
            Kind = "Text"
        ]}[
            Text
        ],
        
        result = try Json.Document(out) otherwise null
    in
        result
in
    Rgx_Tests

Solution

Here are my functions that currently do work.

Rgx_Flags()

This helper generates the string of regex flags for JS.

let Rgx_Flags = (
    optional hasIndices as nullable logical,
    optional global as nullable logical,
    optional ignoreCase as nullable logical,
    optional multiline as nullable logical,
    optional dotAll as nullable logical,
    optional unicode as nullable logical,
    optional unicodeSets as nullable logical,
    optional sticky as nullable logical
) as text =>
    let
        MOD_SEP = "",
        MOD_NOT = "",
        MOD_FLAGS = [
            hasIndices  = "d",
            global      = "g",
            ignoreCase  = "i",
            multiline   = "m",
            dotAll      = "s",
            unicode     = "u",
            unicodeSets = "v",
            sticky      = "y"
        ],
        
        ind = if hasIndices  = true then MOD_FLAGS[hasIndices]  else MOD_NOT,
        glo = if global      = true then MOD_FLAGS[global]      else MOD_NOT,
        ign = if ignoreCase  = true then MOD_FLAGS[ignoreCase]  else MOD_NOT,
        mul = if multiline   = true then MOD_FLAGS[multiline]   else MOD_NOT,
        dot = if dotAll      = true then MOD_FLAGS[dotAll]      else MOD_NOT,
        uni = if unicode     = true then MOD_FLAGS[unicode]     else MOD_NOT,
        uns = if unicodeSets = true then MOD_FLAGS[unicodeSets] else MOD_NOT,
        sti = if sticky      = true then MOD_FLAGS[sticky]      else MOD_NOT,
        
        result = Text.Combine({ind, glo, ign, mul, dot, uni, uns, sti}, MOD_SEP)
    in
        result
in
    Rgx_Flags

Json_Stringify()

This helper converts a value into its JSON representation.

let Json_Stringify = (
    value as any,
    optional encoding as nullable number
) as nullable text =>
    Text.FromBinary(Json.FromValue(value, encoding), encoding)
in
    Json_Stringify

Rgx_Test()

This tests a single text string against a regex pattern, and returns the logical result.

let Rgx_Test = (
    text as text,
    pattern as text,
    optional insensitive as nullable logical
) as nullable logical =>
    let
        txt = Json_Stringify(text),
        ptn = Json_Stringify(pattern),
        flg = Json_Stringify(Rgx_Flags(false, false, insensitive)),
        
        html = "<html><body><script>
            var txt = " & txt & ";
            var ptn = " & ptn & ";
            var flg = " & flg & ";
            var rgx = new RegExp(ptn, flg);
            var tst = rgx.test(txt);
            document.write(tst);
        </script></body></html>",
        
        page = Web.Page(html),
        tst = page{0}[
            Data
        ]{[
            Name = "HTML"
        ]}[Children]{[
            Name = "BODY"
        ]}[Children]{[
            Kind = "Text"
        ]}[
            Text
        ],
        
        result = try Logical.From(tst) otherwise null
    in
        result
in
    Rgx_Test

Solution

  • With help here from @DavideBacci, I have found a fairly stable solution.

    • As requested in the question, it improves performance by using Web.Page() only once, rather than for every string in texts.
    • As cautioned by Davide, it avoids "high order array functions like map...[and] loops", which are likely forbidden by "the web container in PQ". It also avoids JSON.stringify(), as I myself diagnosed here.
    • Furthermore, it requires no literal separators (like '|'), which must otherwise be parsed.

    Solution

    Rgx_Tests_2()

    Using the original helpers Json_Stringify() and Rgx_Flags(), this improvement on Rgx_Tests() actually works in Power Query.

    let Rgx_Tests_2 = (
        texts as list,
        pattern as text,
        optional insensitive as nullable logical
    ) as nullable list =>
        let
            // Constants.
            JS_NEW = "#(lf)",
            HTML_TAG = "<br>",
            
            // Workflow.
            result = if List.IsEmpty(texts) then
                {}
            else
                let
                    strs = List.Transform(texts, Json_Stringify),
                    ptn = Json_Stringify(pattern),
                    flg = Json_Stringify(Rgx_Flags(null, null, insensitive)),
                    tag = Json_Stringify(HTML_TAG),
                    
                    lines = List.Transform(strs, each
                        "document.write(tag + rgx.test(" & _ & "));"
                    ),
                    code = Text.Combine(lines, JS_NEW),
                    
                    html = "<html><body><script>
                        var ptn = " & ptn & ";
                        var flg = " & flg & ";
                        var tag = " & tag & ";
                        var rgx = new RegExp(ptn, flg);
                        " & code & "
                    </script></body></html>",
                    
                    page = Web.Page(html),
                    out = page{0}[Data]{[
                        Name = "HTML"
                    ]}[Children]{[
                        Name = "BODY"
                    ]}[Children],
                    exprs = Table.SelectRows(out, each [Kind] = "Text")[Text],
                    
                    lgls = try List.Transform(exprs, Logical.FromText) otherwise null
                in
                    lgls
        in
            result
    in
        Rgx_Tests_2
    

    Script for Rgx_Tests_2()

    I take a list of texts in M, and represent them as JS literals via my helper Json_Stringify():

    List.Transform({"A1", "2b", "c_3", "d4_"}, Json_Stringify)
    

    Then I .Transform() these into lines of code, which .write() their .test() results to the page, and I .Combine() them into a single block...

    document.write(tag + rgx.test("A1"));
    document.write(tag + rgx.test("2b"));
    document.write(tag + rgx.test("c_3"));
    document.write(tag + rgx.test("d4_"));
    

    ...which I "inject" (splice) into the JS script below:

    <html><body><script>
        var ptn = "[a-z]\\d";
        var flg = "i";
        var tag = "<br>";
        var rgx = new RegExp(ptn, flg);
        
     // ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ
        document.write(tag + rgx.test("A1"));
        document.write(tag + rgx.test("2b"));
        document.write(tag + rgx.test("c_3"));
        document.write(tag + rgx.test("d4_"));
     // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    </script></body></html>

    It renders the results in separate blocks of plain text...

    true

    false

    false

    true

    ...which I drill and extract via Web.Page().

    11

    Finally I parse via Logical.FromText()...

    List.Transform({"true", "false", "false", "true"}, Logical.FromText)
    

    ...to yield a list of logical results.

    {true, false, false, true}