I have to create a regular expression to identify emails. Here it's how it looks so far:
[A-Za-z0-9]+([._-]*[A-Za-z0-9]+)*[@]+[A-Za-z0-9]+([._-]*[A-Za-z0-9]+)*(.com)*
What I want with this regex is to identify an email. The thing is that the email can't start or finish with any non-alphanumeric symbols. So:
.ilikestack@gmail.com or ilikestack@gmail.com_ = invalid
ilike.stack@gmail = valid
But when i run the my Lex program the first two emails above are considered valid and I can't figure out how to change this.
The usual way to control what can and can't appear before and after a regex is to define another regex, or multiple ones, which match the same thing but surrounded by invalid characters.
So if we had the regex [a-z]+
, but we only wanted it to match if it was preceded by only white space (or at the beginning of the file) and followed by only white space or a dot (or the end of file), we could accomplish that as follows:
[a-z]+ printf("Successful match: '%s'!\n", yytext);
[^a-z \t\r\n][a-z]+ ;
[a-z]+[^a-z \t\r\n.] ;
. ;
Then the input ab cd_ ef. .de fg
would produce the output:
Successful match: 'ab'!
Successful match: 'ef'!
Successful match: 'fg'!
For your use case, the simplest solution would be to have two additional rules: One for words that start with a non-email non-whitespace character and extend to the next white space character. And one that ends with a non-email character that isn't a dot (or anything else that's allowed to appear after e-mails):
[^ \t\r\nA-Za-z0-9][^ \t\r\n]* ;
[^ \t\r\n]*[^ \t\r\nA-Za-z0-9.] ;