I wish to perform regex detection (and matching?) in Power Query (M) for Excel, which lacks native support for regex. Unlike Power BI, Excel lacks the integrations with R and Python, so I cannot use them to outsource the regex operations.
Inspired by this clever hack from u/tirlibibi17, I have generalized a native solution by integrating M with JavaScript (JS) and its regex functionality. See the Solution section below for my full work.
While I have successfully tested Rgx_Test()
on a single text
string, it bogs down at scale, because it regenerates the Web.Page()
for every string being tested. With the plural Rgx_Tests()
, I aspire to a operate on a batch of many such strings, all in one fell swoop.
My "intended script" for Rgx_Tests()
should improve on the "current script" for Rgx_Test()
. The native Json.*()
family in M facilitates the exchange of batch data with the JS script, as represented in JSON notation. For input, a list
of text
s in M becomes an Array
of String
s in JS. For output, an Array
of Boolean
s in JS becomes a list
of logical
s in M.
Unfortunately, a mysterious failure has cut short my progress on Rgx_Tests()
:
While my current script properly renders a Boolean
via document.write()
, my intended script fails to likewise render a String
of JSON text. This String
is generated by JSON.stringify()
from the Array
of Boolean
s.
This problem occurs only in M, and not in JS, as demonstrated (below) by the intended script in isolation. Presumably the fault lies with Web.Page()
, but why should it render a Boolean
yet fail to render a String
?
No error is thrown. The current script renders the Boolean
in a text block, where I can easily find it. With the intended script, this text block is simply missing.
See the Debugging section below for details.
What is the origin of this mysterious failure, and how do I fix it?
I do know that a JS script, with errors in syntax or at runtime, will yield this behavior. However, that JS script (below) runs perfectly on its own. Perhaps functions like JSON.stringify()
are incompatible with the version of JS that M is using?
Rgx_Test()
My current script works smoothly, and renders raw text for M to interpret as a logical
.
I take the single text
string in M, and convert it via my helper Json_Stringify()
...
Json_Stringify("A1")
...into its JSON representation...
"A1"
...which I "inject" (splice) into the JS script below.
<html><body><script>
// ˇˇˇˇ
var txt = "A1";
// ^^^^
var ptn = "[a-z]\\d";
var flg = "i";
var rgx = new RegExp(ptn, flg);
var tst = rgx.test(txt);
document.write(tst);
</script></body></html>
It renders the result as raw text...
true
...which I drill and extract via Web.Page()
:
I then parse it via Logical.FromText()
...
Logical.FromText("true")
...to yield a logical
result.
true
Rgx_Tests()
I aspire to the script below, which succeeds in JS but fails to render JSON text for M to interpret as a list
of logical
s.
I take a list
of text
s in M, and convert it via my helper Json_Stringify()
...
Json_Stringify({"A1", "2b", "c_3", "d4_"})
...into its JSON representation...
["A1","2b","c_3","d4_"]
...which I "inject" (splice) into the JS script below:
<html><body><script>
// ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ
var txts = ["A1","2b","c_3","d4_"];
// ^^^^^^^^^^^^^^^^^^^^^^^
var ptn = "[a-z]\\d";
var flg = "i";
var rgx = new RegExp(ptn, flg);
function rgx_test(txt) {return rgx.test(txt)};
var tsts = txts.map(rgx_test);
var out = JSON.stringify(tsts);
document.write(out);
</script></body></html>
It should render the results as (JSON) text...
[true,false,false,true]
...which I could drill and extract via Web.Page()
:
I could then parse via Json.Document()
...
Json.Document("[true,false,false,true]")
...to yield a list
of logical
results.
{true, false, false, true}
Unfortunately, the corresponding record
for this text block is entirely absent from the table
output by Web.Page()
:
Rgx_Tests()
This should test a list
of text
strings against a regex pattern
, and return a list
of the logical
results.
let Rgx_Tests = (
texts as list,
pattern as text,
optional insensitive as nullable logical
) as nullable list =>
let
txts = Json_Stringify(texts),
ptn = Json_Stringify(pattern),
flg = Json_Stringify(Rgx_Flags(false, false, insensitive)),
html = "<html><body><script>
var txts = " & txts & ";
var ptn = " & ptn & ";
var flg = " & flg & ";
var rgx = new RegExp(ptn, flg);
function rgx_test(txt) {return rgx.test(txt)};
var tsts = txts.map(rgx_test);
var out = JSON.stringify(tsts);
document.write(out);
</script></body></html>",
page = Web.Page(html),
out = page{0}[
Data
]{[
Name = "HTML"
]}[Children]{[
Name = "BODY"
]}[Children]{[
Kind = "Text"
]}[
Text
],
result = try Json.Document(out) otherwise null
in
result
in
Rgx_Tests
Here are my functions that currently do work.
Rgx_Flags()
This helper generates the string of regex flags for JS.
let Rgx_Flags = (
optional hasIndices as nullable logical,
optional global as nullable logical,
optional ignoreCase as nullable logical,
optional multiline as nullable logical,
optional dotAll as nullable logical,
optional unicode as nullable logical,
optional unicodeSets as nullable logical,
optional sticky as nullable logical
) as text =>
let
MOD_SEP = "",
MOD_NOT = "",
MOD_FLAGS = [
hasIndices = "d",
global = "g",
ignoreCase = "i",
multiline = "m",
dotAll = "s",
unicode = "u",
unicodeSets = "v",
sticky = "y"
],
ind = if hasIndices = true then MOD_FLAGS[hasIndices] else MOD_NOT,
glo = if global = true then MOD_FLAGS[global] else MOD_NOT,
ign = if ignoreCase = true then MOD_FLAGS[ignoreCase] else MOD_NOT,
mul = if multiline = true then MOD_FLAGS[multiline] else MOD_NOT,
dot = if dotAll = true then MOD_FLAGS[dotAll] else MOD_NOT,
uni = if unicode = true then MOD_FLAGS[unicode] else MOD_NOT,
uns = if unicodeSets = true then MOD_FLAGS[unicodeSets] else MOD_NOT,
sti = if sticky = true then MOD_FLAGS[sticky] else MOD_NOT,
result = Text.Combine({ind, glo, ign, mul, dot, uni, uns, sti}, MOD_SEP)
in
result
in
Rgx_Flags
Json_Stringify()
This helper converts a value
into its JSON representation.
let Json_Stringify = (
value as any,
optional encoding as nullable number
) as nullable text =>
Text.FromBinary(Json.FromValue(value, encoding), encoding)
in
Json_Stringify
Rgx_Test()
This tests a single text
string against a regex pattern
, and returns the logical
result.
let Rgx_Test = (
text as text,
pattern as text,
optional insensitive as nullable logical
) as nullable logical =>
let
txt = Json_Stringify(text),
ptn = Json_Stringify(pattern),
flg = Json_Stringify(Rgx_Flags(false, false, insensitive)),
html = "<html><body><script>
var txt = " & txt & ";
var ptn = " & ptn & ";
var flg = " & flg & ";
var rgx = new RegExp(ptn, flg);
var tst = rgx.test(txt);
document.write(tst);
</script></body></html>",
page = Web.Page(html),
tst = page{0}[
Data
]{[
Name = "HTML"
]}[Children]{[
Name = "BODY"
]}[Children]{[
Kind = "Text"
]}[
Text
],
result = try Logical.From(tst) otherwise null
in
result
in
Rgx_Test
With help here from @DavideBacci, I have found a fairly stable solution.
Web.Page()
only once, rather than for every string in texts
.JSON.stringify()
, as I myself diagnosed here.'|'
), which must otherwise be parsed.Rgx_Tests_2()
Using the original helpers Json_Stringify()
and Rgx_Flags()
, this improvement on Rgx_Tests()
actually works in Power Query.
let Rgx_Tests_2 = (
texts as list,
pattern as text,
optional insensitive as nullable logical
) as nullable list =>
let
// Constants.
JS_NEW = "#(lf)",
HTML_TAG = "<br>",
// Workflow.
result = if List.IsEmpty(texts) then
{}
else
let
strs = List.Transform(texts, Json_Stringify),
ptn = Json_Stringify(pattern),
flg = Json_Stringify(Rgx_Flags(null, null, insensitive)),
tag = Json_Stringify(HTML_TAG),
lines = List.Transform(strs, each
"document.write(tag + rgx.test(" & _ & "));"
),
code = Text.Combine(lines, JS_NEW),
html = "<html><body><script>
var ptn = " & ptn & ";
var flg = " & flg & ";
var tag = " & tag & ";
var rgx = new RegExp(ptn, flg);
" & code & "
</script></body></html>",
page = Web.Page(html),
out = page{0}[Data]{[
Name = "HTML"
]}[Children]{[
Name = "BODY"
]}[Children],
exprs = Table.SelectRows(out, each [Kind] = "Text")[Text],
lgls = try List.Transform(exprs, Logical.FromText) otherwise null
in
lgls
in
result
in
Rgx_Tests_2
Rgx_Tests_2()
I take a list
of text
s in M, and represent them as JS literals via my helper Json_Stringify()
:
List.Transform({"A1", "2b", "c_3", "d4_"}, Json_Stringify)
Then I .Transform()
these into lines of code, which .write()
their .test()
results to the page, and I .Combine()
them into a single block...
document.write(tag + rgx.test("A1"));
document.write(tag + rgx.test("2b"));
document.write(tag + rgx.test("c_3"));
document.write(tag + rgx.test("d4_"));
...which I "inject" (splice) into the JS script below:
<html><body><script>
var ptn = "[a-z]\\d";
var flg = "i";
var tag = "<br>";
var rgx = new RegExp(ptn, flg);
// ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ
document.write(tag + rgx.test("A1"));
document.write(tag + rgx.test("2b"));
document.write(tag + rgx.test("c_3"));
document.write(tag + rgx.test("d4_"));
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
</script></body></html>
It renders the results in separate blocks of plain text...
true
false
false
true
...which I drill and extract via Web.Page()
.
Finally I parse via Logical.FromText()
...
List.Transform({"true", "false", "false", "true"}, Logical.FromText)
...to yield a list
of logical
results.
{true, false, false, true}