Search code examples
regexmachine-learningregular-language

How to learn regular expressions


I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).

I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words and I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.

What techniques are availalbe which can do that?


I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.


Again formulated a bit different:

Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).

I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:

lim_n->infinity | diff( LRE(M_n), L ) | = 0


Solution

  • This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.

    Once you have the DFA generating a regular expression is trivial. To avoid the problems @FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.