regex string algorithm pattern-matching trie

how to determine the sub string in a regular expression?

I have implemented a trie for pattern searching and that is working fine. Using this trie I can find all the keywords present in text in O(n) complexity.

Problem is I want to use regex for my patterns(keywords) and want to find all the keywords present in the text.

Example: I write [a-z0-9\.]{6, 30}\@[a-z0-9\.]{2,12}\.[a-z0-9]{2,6} to find email id and it will fetch me the right thing but it will not find the sub-strings lying under first or second block.

For example i have the text as. examplegmail@gmail.com

and keywords are : ample mail

In this example this regex will tell me the email id's ending position but it will not tell anything about ample or mail keyword.

EDIT: suppose I have regex as a*(b|cd?)+ and the DFA would look like::

enter image description here

and now I have data like dfdfdacbcbbcb in this data it'll tell me the patterns after reaching at ac and so on at every character but how shall I get to know the length of ending patterns????

Solution

Your "trie" contains operations: "test for char" "branch to nth subtree".

Add another operator to save positions: "remember Nth character index" which writes the current character position the trie is inspecting into the nth slot of an array of pointers into the string.

Insert these operators in your (abstract) trie specification, compile to the real trie and then run it. As the trie matcher "crosses" various critical points in your match, it can record those points in the string buffer. At a final match, you have an array of pointers (as many as you like) to the subparts of your match.

For your example:

[a-z0-9\.]{6, 30}\@[a-z0-9\.]{2,12}\.[a-z0-9]{2,6}

imagine I want to pick the text to the left and right of the the @.

I add position saving operators, which I have arbitrarily denoted as "#n":

#1[a-z0-9\.]{6, 30}#2\@[a-z0-9\.]{2,12}\.[a-z0-9]{2,6}#3

This will (rather trivially) capture the starting position, the position of the "@" sign, and (rather trivially) the end position, as positions 1, 2 and 3. Of course, you can more more in the middle, as you see fit.

[Many regex systems implicitly do this where they encounter grouping operators (...), number the groupings from left to right. That can always be enough, because you can always wrap an interesting sub-regex in such a grouping operator. I like the explicit indication scheme; it is clear to the reader, and the pattern matcher, where it must insert these position-capture operations. We have implemented regex matchers using exactly the #n notation above.].

If you are looking for a wide variety of keywords and related text, your trie likely has a lot of choice operators in it. You can add these position capture operators in appropriate places in each choice branch to pick out information relevant to the keyword. You may need to add another operator, "recognized keyword k", to help the code that interprets the pattern matcher result understand which special keywords got found, and therefore how to interpret the position indexes.