Search code examples
regexlex

How do I restrict what comes before and after a regex


I have to create a regular expression to identify emails. Here it's how it looks so far:

[A-Za-z0-9]+([._-]*[A-Za-z0-9]+)*[@]+[A-Za-z0-9]+([._-]*[A-Za-z0-9]+)*(.com)*

What I want with this regex is to identify an email. The thing is that the email can't start or finish with any non-alphanumeric symbols. So:

.ilikestack@gmail.com or ilikestack@gmail.com_ = invalid
ilike.stack@gmail = valid

But when i run the my Lex program the first two emails above are considered valid and I can't figure out how to change this.


Solution

  • The usual way to control what can and can't appear before and after a regex is to define another regex, or multiple ones, which match the same thing but surrounded by invalid characters.

    So if we had the regex [a-z]+, but we only wanted it to match if it was preceded by only white space (or at the beginning of the file) and followed by only white space or a dot (or the end of file), we could accomplish that as follows:

    [a-z]+                printf("Successful match: '%s'!\n", yytext);
    [^a-z \t\r\n][a-z]+   ;
    [a-z]+[^a-z \t\r\n.]  ;
    .                     ;
    

    Then the input ab cd_ ef. .de fg would produce the output:

    Successful match: 'ab'!
    Successful match: 'ef'!
    Successful match: 'fg'!
    

    For your use case, the simplest solution would be to have two additional rules: One for words that start with a non-email non-whitespace character and extend to the next white space character. And one that ends with a non-email character that isn't a dot (or anything else that's allowed to appear after e-mails):

    [^ \t\r\nA-Za-z0-9][^ \t\r\n]*   ;
    [^ \t\r\n]*[^ \t\r\nA-Za-z0-9.]  ;