Search code examples
regexstate-machineregular-languagefinite-automata

Why don't regular expression engines support all set operations?


Similar to this question on a private forum: Why don’t RegEx implementations support intersection and complement?

Finite automatas built from regular expressions are closed under the set operations union, intersection, complement, and difference. These FA are closed under the two additional operations concatenation and Kleene star.

Anecdotally, concatenation, union, and star operations are ubiquitous in regular expression implementations. Why don't regular expression engines typically support the other set operations intersection, complement, and difference?

An example FA demonstrating the intersection of the two FA substring 01 and odd number of 1s from these lecture notes.

FA construction showing closure under intersection Citation:

Scott Aaronson. 6.045J Automata, Computability, and Complexity. Spring 2011. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.


Solution

  • Regex engines that support lookarounds let you implement intersection, complement and difference :

    • (?=pattern1)pattern2 will match a string that both pattern1 and pattern2 match
    • (?!pattern).* will match anything that isn't matched by pattern (although more realistically you'd use pattern as a regex and have your higher-level environment reverse the match result)
    • (?!pattern1)pattern2 will match a string that is matched by pattern2 but not by pattern1

    However lookarounds are a rather recent feature in the history of regular expressions and still aren't supported by many regex engines. Why is that?
    I'm not well versed in regex's history, but if I believe a cursory glance at Wikipedia articles, they originate from mathematician Stephen Cole Kleene's definition of regular languages which is only based on the union, concatenation and Kleene star operations, which might explain why those are the basic operations in regular expressions.