Have a regex:
.*?
(rule1|rule2)
(?:(rule1|rule2)|[^}])*
(It's designed to parse CSS files, and the 'rules' are generated by JS.)
When I try this in IE, all works as it should. Ditto when I try it in RegexBuddy or The Regex Coach.
But when I try it in Firefox or Chrome, the results are missing values.
Can anyone please explain what the real browsers are thinking, or how I can achieve results similar to IE's?
To see this in action, load up a page that gives you interactive testing, such as the W3Schools try-it-out editor.
Here's the source that can be pasted in: http://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_regexp_exec
<html>
<body>
<script type="text/javascript">
var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;
var result=patt.exec(str);
for(var i = 0; i < 3; i++) document.write(i+": " + result[i]+"<br>");
</script>
</body>
</html>
Here is the output in IE:
0: #rot { rule1; rule2;
1: rule1
2: rule2
Here is the output in Firefox and Chrome:
0: #rot { rule1; rule2;
1: rule1
2: undefined
When I try the same using string.match, I get back an array of undefined in all browsers, including IE.
var str="#rot { rule2; rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/gi;
var result=str.match(patt);
for(var i = 0; i < 5; i++) document.write(i+": "+result[i]+"<br>");
As far as I can tell, the issue is the last non-capturing parenthesis.
When I remove them, the results are consistent cross browser - and match() gets results.
However, it does capture from the last parenthesis, in all browsers, in the following example:
<script>
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
var result=patt.exec(str);
for(var i =0; i < 3; i++) document.write(i+": "+result[i]+"<br>");
</script>
Notice that I've added a space to the patterns in the second regex.
The same applies if I add any negative character to the strings in the second regex:
var patt=/.*?(rule1|rule2)(?:(rule1[^1]|rule2[^1])|[^}])*/gi;
What the expletive is going on?!
All other strings that I've tried result in the first set of non-catches.
Any help is greatly appreciated!
EDIT:
The code has been shortened, and many hours of research put in, on Mathhew's advice.
The title has been changed to make the thread easier to find.
I have marked Mathew's answer as correct, as it is well researched and described.
My answer below (written before Mathew revised his) states the logic in simpler and more direct terms.
IE is wrong. In ECMAScript, exactly one alternative can result in a string. All the others have to be undefined
(not ""
or anything else).
So for your alternatives, including (transform[^-][^;}]+)|(transform-origin[^;}]+)
, Firefox and Chrome are correct in setting the failed capture to undefined
.
There's an example in the ECMAScript 5 standard (§15.10.2.3) specifically about this:
NOTE The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression). If the left Alternative, the right Disjunction, and the sequel all have choice points, all choices in the sequel are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative. Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
Thus, for example, /a|ab/.exec("abc") returns the result "a" and not "ab". Moreover, /((a)|(ab))((c)|(bc))/.exec("abc") returns the array ["abc", "a", "a", undefined, "bc", undefined, "bc"] and not ["abc", "ab", undefined, "ab", "c", "c", undefined]
EDIT: I figured the last part out. This applies to the original as well as the simplified version. In both cases, rule1
and rule2
can't match the ;
(in the original because ;
is in the negated character class [^;}]
). Thus, when a ;
hit between declarations, the alternation chooses [^}]
. Thus, it must set the last two captures to undefined
.
For the *
to be fully greedy, the final ;
and space in the input must also be matched. For the last two *
repetitions (';' and ' '), the alternation again chooses [^}]
, so the captures should be set undefined
at the end too.
IE fails to do this in both cases, so they stay equal to "rule1" and "rule2".
Finally, the reason that the second example behaves differently is that (transform-origin[^;}]+))
matches on the very last *
repetition, since there's no ;
before the end.
EDIT 2: I'll walk through what should be happening both current examples. match
is the match array.
var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;
.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"
Star 1
[^}] - ";"
match[2] = undefined
Star 2
[^}] - " "
match[2] = undefined
Star 3
(rule1|rule2) - "rule2"
match[2] = "rule2"
Star 4
[^}] - ";"
match[2] = undefined
Star 5
[^}] - " "
match[2] = undefined
Again, IE isn't setting match[2] to undefined
.
For the str.match
example, you're using the global flag. That means it returns an array of matches, without captures. This applies to any use of String.match
. If you use g
, you have to use exec to get captures.
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"
Star 1
[^}] - ";"
match[2] = undefined
Star 2
[^}] - " "
match[2] = undefined
Star 3
(rule1 |rule2 ) - "rule2 "
match[2] = "rule2 "
Since this is the last *
, the capture never gets set to undefined.