I'm looking for something that allows me to sort a list of regular expression, or some documentation and research,
according to their specificity/strictness
/[a-z]+/ // most strict
/[a-z0-9]+/
/[a-z0-9èòà]+/ // less strict
/.*/
but how about
/[a-z]+ABC/
/[a-z0-9]+/
which one is less specific than the other?
thank you in advance
One can equate a regular expression to the set of strings it matches (called a 'regular language'.) If our regular expression is named E
, let's call its matching strings L(E)
.
Strictness in the sense you are alluding to above then becomes the subset relation: define RE A
to be stricter than RE B
if L(A)
is a proper subset of L(B)
. This puts to rest ambiguities like synonyms for the "same" RE: they are the same precisely because they have the same regular language.
As @yi_H points out, the subset relation over RE languages (over some common alphabet) forms a partial ordering. You sound like you want a total ordering. If so, you can stipulate that an acceptable total ordering should embed the partial ordering represented by the subset relation.
I don't have a clear answer for how to build that total ordering, but two approaches come to mind.
The first is to exploit the pumping lemma. It turns out that for any RE, if it matches a sufficiently long string, then it must also match a longer string constructible from the first by repeating some subsection. You could ask what is the length of the longest matching string that does not have any such repeating segments, and make that your metric. Maybe that respects (embeds) the partial ordering, maybe it doesn't.
The other is to consider graph transformations on the RE's state machine. I suspect (but I don't have any reference) that if RE A
is properly stricter than RE B
, then B
's automaton will be calculable from A
's by collapsing states or some similar simplifying action. You could define your metric to be the number of states in the RE's smallest automaton.