Search code examples
regexdfanfa

How to implement a NFA or DFA based regexp matching algorithm to find all matches?


The matches can be overlapped.

But if multiple matches are found starting from a same position, pick the short one.

For example, to find regexp parttern "a.*d" in a string "abcabdcd", the answer should be {"abcabd", "abd"}. And "abcabdcd" and "abdcd" should not be included.


Solution

  • Most RE engines only match an RE once and greedily by default, and standard iteration strategies built around them tend to restart the search after the end of the previous match. To do other than that requires some extra trickery. (This code is Tcl, but you should be able to replicate it in many other languages.)

    proc matchAllOverlapping {RE string} {
        set matches {}
        set nonGreedyRE "(?:${RE}){1,1}?"
        set idx 0
        while {[regexp -indices -start $idx $nonGreedyRE $string matchRange]} {
            lappend matches [string range $string {*}$matchRange]
            set idx [expr { [lindex $matchRange 0] + 1 }]
        }
        return $matches
    }
    puts [matchAllOverlapping a.*d abcabdcd]