Search code examples
javaregexemail-validationrfc5322

Email Id validation according to RFC5322 and https://en.wikipedia.org/wiki/Email_address


Validating E-mail Ids according to RFC5322 and following

https://en.wikipedia.org/wiki/Email_address

Below is the sample code using java and a regular expression to validate E-mail Ids.

public void checkValid() {
    List<String> emails = new ArrayList();
    //Valid Email Ids
    emails.add("[email protected]");
    emails.add("[email protected]");                   
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("carlosd'[email protected]");
    emails.add("[email protected]");
    emails.add("admin@mailserver1");
    emails.add("[email protected]");
    emails.add("\" \"@example.org");
    emails.add("\"john..doe\"@example.org");

    //Invalid emails Ids
    emails.add("Abc.example.com");
    emails.add("A@b@[email protected]");
    emails.add("a\"b(c)d,e:f;g<h>i[j\\k][email protected]");
    emails.add("just\"not\"[email protected]");
    emails.add("this is\"not\\[email protected]");
    emails.add("this\\ still\"not\\[email protected]");
                    emails.add("1234567890123456789012345678901234567890123456789012345678901234+x@example.com");
    emails.add("[email protected]");
    emails.add("[email protected]");

    String regex = "^[a-zA-Z0-9_!#$%&'*+/=? \\\"`{|}~^.-]+@[a-zA-Z0-9.-]+$";

    Pattern pattern = Pattern.compile(regex);
    int i=0;
    for(String email : emails){
        Matcher matcher = pattern.matcher(email);
        System.out.println(++i +"."+email +" : "+ matcher.matches());
    }
}

Actual Output:

   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   9.carlosd'[email protected] : true
   [email protected] : true
   11.admin@mailserver1 : true
   [email protected] : true
   13." "@example.org : true
   14."john..doe"@example.org : true
   15.Abc.example.com : false
   16.A@b@[email protected] : false
   17.a"b(c)d,e:f;g<h>i[j\k][email protected] : false
   18.just"not"[email protected] : true
   19.this is"not\[email protected] : false
   20.this\ still"not\[email protected] : false
   21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com    : true
   [email protected] : true
   [email protected] : true

Expected Ouput:

[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
9.carlosd'[email protected] : true
[email protected] : true
11.admin@mailserver1 : true
[email protected] : true
13." "@example.org : true
14."john..doe"@example.org : true
15.Abc.example.com : false
16.A@b@[email protected] : false
17.a"b(c)d,e:f;g<h>i[j\k][email protected] : false
18.just"not"[email protected] : false
19.this is"not\[email protected] : false
20.this\ still"not\[email protected] : false
21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com : false
[email protected] : false
[email protected] : false

How can I change my regular expression so that it will invalidate the below patterns of email ids.

1234567890123456789012345678901234567890123456789012345678901234+x@example.com
[email protected]
[email protected] 
just"not"[email protected]

Below are the criteria for regular expression:

Local-part

The local-part of the email address may use any of these ASCII characters:

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9;
  3. special characters !#$%&'*+-/=?^_`{|}~
  4. dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. [email protected] is not allowed but "John..Doe"@example.com is allowed);
  5. space and "(),:;<>@[\] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash); comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)[email protected] are both equivalent to [email protected].

Domain

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9, provided that top-level domain names are not all-numeric;
  3. hyphen -, provided that it is not the first or last character. Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and [email protected](comment) are equivalent to [email protected].

Solution

  • You could RFC5322 like this
    ( reference regex modified )

    "(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|((?:[0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"  
    

    https://regex101.com/r/ObS3QZ/1

     # (?im)^(?=.{1,64}@)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:(?=.{1,63}\.)[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\w]*))$
    
     # Note - remove all comments '(comments)' before runninig this regex
     # Find  \([^)]*\)  replace with nothing
    
     (?im)                                     # Case insensitive
     ^                                         # BOS
    
                                               # Local part
     (?= .{1,64} @ )                           # 64 max chars
     (?:
          (                                         # (1 start), Quoted
               " [^"\\]* 
               (?: \\ . [^"\\]* )*
               "
               @
          )                                         # (1 end)
       |                                          # or, 
          (                                         # (2 start), Non-quoted
               (?:
                    [0-9a-z] 
                    (?:
                         \.
                         (?! \. )
                      |                                          # or, 
                         [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
                    )*
               )?
               [0-9a-z] 
               @
          )                                         # (2 end)
     )
                                               # Domain part
     (?= .{1,255} $ )                          # 255 max chars
     (?:
          (                                         # (3 start), IP
               \[
               (?: \d{1,3} \. ){3}
               \d{1,3} \]
          )                                         # (3 end)
       |                                          # or,   
          (                                         # (4 start), Others
               (?:                                       # Labels (63 max chars each)
                    (?= .{1,63} \. )
                    [0-9a-z] [-\w]* [0-9a-z]* 
                    \.
               )+
               [a-z0-9] [\-a-z0-9]{0,22} [a-z0-9] 
          )                                         # (4 end)
       |                                          # or,
          (                                         # (5 start), Localdomain
               (?= .{1,63} $ )
               [0-9a-z] [-\w]* 
          )                                         # (5 end)
     )
     $                                         # EOS
    

    How make [email protected] this as valid email ID – Mihir Feb 7 at 9:34

    I think the spec wants the local part to be either encased in quotes
    or, to be encased by [0-9a-z].

    But, to get around the later and make [email protected] valid, just
    replace group 2 with this:

          (                             # (2 start), Non-quoted
               [0-9a-z] 
               (?:
                    \.
                    (?! \. )
                 |                              # or, 
                    [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
               )*
               @
    
          )                             # (2 end)
    

    New regex

    "(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|([0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"
    

    New demo

    https://regex101.com/r/ObS3QZ/5