Search code examples
phpregexemailpcre

fully RFC5321- and 5322-compatible PHP PCRE regex


I'm trying to create a PHP PCRE regex that is (almost) fully compatible with RFC5321 and 5322 to test email addresses. The only thing I don't require is the (comment) part. I've seen some other attempts at this posted on here, but when I run tests vs. them they don't all work.

I have been working on one that is very close:

 ^(([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})|("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}"))@(([\w\-]*\.?[\w\-]*)|(\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])|(\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\]))$

To break it down:

Local part:

(

Match at most 64 of the allowed characters

   ([\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64})
    |

OR match the same set of characters in a quoted string:

   ("[\w \!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\|\}\~\.]{1,64}")
)

end local part.

match @ sign

@

match domain part:

(

match domain part using allowed characters:

   ([\w\-]*\.?[\w\-]*)

or ipv4 (it doesn't check to make sure they are < 255 - that would be handled elsewhere)

   (\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\])

or ipv6

   (\[IPv6:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}:[\da-fA-F]{0,4}\])

)

The only thing it's missing is the ability to check for multiple consecutive .'s (periods) that are outside a quoted local-part. I ran tests on regex101.com vs. all the addresses below using some of my own tests and the tests on the wikipedia article about email addresses:

bob@smith.com
bob.smith@smith.com
bob-smith@smith.com
bob-smith@bob-smith.com
b0b!-...smith@smith.com <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
bob&smith@smith.com
"bob..smith"@smith.com

simple@example.com
very.common@example.com
disposable.style.email.with+symbol@example.com
other.email-with-hyphen@example.com
fully-qualified-domain@example.com
user.name+tag+sorting@example.com
x@example.com
example-indeed@strange-example.com
admin@mailserver1
example@s.example
" "@example.org
"john..doe"@example.org

Abc.example.com
A@b@c@example.com
a"b(c)d,e:f;g<h>i[j\k]l@example.com
just"not"right@example.com
this is"not\allowed@example.com
this\ still\"not\\allowed@example.com
1234567890123456789012345678901234567890123456789012345678901234+x@example.com
john..doe@example.com  <-DOES NOT VALIDATE CORRECTLY - MULTIPLE .'s
john.doe@example..com

I attempted to use lookahead and lookbehind assertions to test for the consecutive periods, but I couldn't figure it out. I think that's the only thing it's missing (other than the comments, which for my purposes aren't required).

Is there a way to check for the periods that wouldn't alter what I currently have too much, or would it require a different approach?

Please let me know if I missed anything else.

Thank you.


Solution

  • You may add (?!("[^"]*"|[^"])*\.{2}) after ^.

    See the regex demo.

    The (?!("[^"]*"|[^"])*\.{2}) negative lookahead fails the match if, immediately to the right of the current location, there is

    • ("[^"]*"|[^"])* - 0 or more occurrences of a " followed with 0+ chars other than " and then " or any char other than "
    • \.{2} - two consecutive dots.