Search code examples
phpregexalgorithmlanguage-agnosticdfa

Create set of all possible matches for a given regex


I'm wondering how to find a set of all matches to a given regex with a finite number of matches.

For example:

All of these example you can assume they start with ^ and end with $

`hello?` -> (hell, hello)
`[1-9][0-9]{0,3}` -> (1,2,3 ..., 9998, 9999)
`My (cat|dog) is awesome!` -> (My cat is awesome!, My dog is awesome!)
`1{1,10}` -> (1,11, ..., 111111111, 1111111111)
`1*` -> //error
`1+` -> //error
`(1|11){2}` -> (1,11,111,1111) //notice how it doesn't repeat any of the possibilities

I'd also be interested if there was a way of retrieving count a unique a solutions to the regex or if there is a way to determine if the regex has a finite solutions.

It would be nice if the algorithm could parse any regex, but a powerful enough subset of the regex would be fine.

I'm interested in a PHP solution to this problem, but other languages would also be fine.

EDIT:

I've learned in my Formal Theory class about DFA which can be used to implement regex (and other regular languages). If I could transform the regex into a DFA the solution seems fairly straight forward to me, but that transformation seems rather tricky to me.

EDIT 2:

Thanks for all the suggestions, see my post about the public github project I'm working on to "answer" this question.


Solution

  • I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.

    It currently passes the following unit tests.

    <?php
    
    class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
    {
    
        function dataProviderForTestSimpleRead()
        {
            return array(
                array( "^ab$", array( "ab" ) ),
                array( "^(ab)$", array( "ab" ) ),
                array( "^(ab|ba)$", array( "ab", "ba" ) ),
                array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
                array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
                array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
                array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
                array( "^hello?$", array( "hell", "hello" ) ),
                array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
                array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
                array( '^\n$', array( "\n" ) ),
                array( '^\r$', array( "\r" ) ),
                array( '^\t$', array( "\t" ) ),
                array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
                array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
            );
        }
    
        /**
         * @dataProvider dataProviderForTestSimpleRead
         */
    
        function testSimpleRead( $regex_string, $expected_matches_array )
        {
            $lexer = new RegexCompiler_Lexer();
            $actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
            sort( $actualy_matches_array );
            sort( $expected_matches_array );
            $this->assertSame( $expected_matches_array, $actualy_matches_array );
        }
    
    }
    
    ?>
    

    I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.