I have a set of strings like this:
{"01", "001", 000111", 01111"}
Now, I wonder if I can ask Microsoft Z3 to find the minimal Regular Expression that satisfies the strings. For example, the output for this particular set should be something like 0*1*
.
In theory, yes. In practice, yes, but probably not in the way you wanted. At least, not yet.
Strictly speaking, it's easy to code up what you want. Here's how you'd code the problem for the set you wanted:
(set-logic QF_S)
(set-option :produce-models true)
(define-fun re1 () RegLan (str.to_re "01"))
(define-fun re2 () RegLan (str.to_re "001"))
(define-fun re3 () RegLan (str.to_re "000111"))
(define-fun re4 () RegLan (str.to_re "01111"))
(define-fun re5 () RegLan (re.union re1 re2 re3 re4))
(check-sat)
(get-value (re5))
And indeed, z3 says:
sat
((re5 (re.union (str.to_re "01")
(re.union (str.to_re "001")
(re.union (str.to_re "000111") (str.to_re "01111"))))))
But I can hear you groaning and saying "but that's not what I want!" And indeed, while this is a correct solution, it's probably not what you wanted. It is by no means "minimal" in any satisfying sense.
To really do what you need, you want to be able to write down variables that can stand for regular-expressions themselves. And indeed SMTLib, and thus z3, allow regular-expressions to be symbolic. See here for details: https://smtlib.cs.uiowa.edu/theories-UnicodeStrings.shtml. However, here's a direct quote from that page regarding what sorts of regular-expressions are allowed:
Function str.to_re allows one to write symbolic regular expressions, e.g., RegLan terms with subterms like (str.to_re x) where x is a variable. Such terms have more expressive power than regular expressions. This is intentional, for future developments. The restriction to actual regular expressions will be imposed in a logic where str.to_re will be applicable to string constants only.
And so far as I know there's no SMT-solver out there, including z3, that support symbolic regular expressions yet. Here's a simple experiment:
(set-logic QF_S)
(declare-fun x () RegLan)
(assert (str.in_re "0" x))
(check-sat)
When I feed this to z3, I get:
unknown
i.e., it's a valid problem but z3 can't handle it. If you feed it to cvc4, it's more vocal about the problem:
(error "Regular expression variables are not supported.")
So, to sum up: What you're trying to do is possible in SMTLib, but it isn't currently supported by any solver, at least not in the way you'd want it to work. Hopefully this might change in the future, though I wouldn't hold my breath.
Given this, what can you do? I think using an SMT solver for this is a bit of an overkill. For each constant string you can construct the regular expression that recognizes it. Turn that into a DFA. And then take the union of all those DFAs. Finally, turn it into a minimal DFA, and read off the regular-expression that comes off of it. This is quite a bit of work, but should work well in practice, i.e., all the parts of this strategy is well-known how to implement, and should be fairly cheap in terms of run-time complexity.