Regex is blowing my mind. How can I change this to validate emails with a plus sign? so I can sign up with [email protected]
if(!preg_match("/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*$/i", $_GET['em'])) {
It seems like you aren't really familiar with what your regex is doing currently, which would be a good first step before modifying it. Let's walk through your regex using the email address [email protected]
(in each section below, the bolded part is what is matched by that section):
^
is the start of string
anchor.
It specifies that any match must
begin at the beginning of the
string. If the pattern is not
anchored, the regex engine can match
a substring, which is often
undesired.
Anchors are zero-width, meaning that they do not capture any characters.
[_a-z0-9-]+
is made up of two
elements, a character
class
and a repetition
modifer:
[...]
defines a character class, which tells the regex engine,
any of these characters are valid matches. In this case the class
contains the characters a-z, numbers
0-9 and the dash and underscore (in
general, a dash in a character class
defines a range, so you can use
a-z
instead of
abcdefghijklmnopqrstuvwxyz
; when
given as the last character in the
class, it acts as a literal dash).+
is a repetition modifier that specifies that the preceding token
(in this case, the character class)
can be repeated one or more times.
There are two other repetition
operators: *
matches zero or more
times; ?
matches exactly zero or
one times (ie. makes something
optional).(captures john[email protected])
(\.[_a-z0-9-]+)*
again contains a
repeated character class. It also
contains a
group,
and an escaped character:
(...)
defines a group, which allows you to group multiple tokens
together (in this case, the group
will be repeated as a
whole).abc*
, the repetition modifier
would only apply to the c
, because
c is the last token before the
modifier. In order to get around
this, we can group abc ((abc)*
),
in which case the modifier would
apply to the entire group, as if it
was a single token.\.
specifies a literal dot character. The reason this is needed
is because .
is a special
character in regex, meaning any
character.
Since we want to match an actual dot
character, we need to escape it.(captures john.robert.smith@mail.com)
@
is not a special character in
regex, so, like all other
non-special characters, it matches
literally.
(captures john.robert.smith@mail.com)
[a-z0-9-]+
again defines a repeated character class, like item #2 above.
(captures john.robert.smith@mail.com)
(\.[a-z0-9-]+)*
is almost exactly the same pattern as #3 above.
(captures john.robert.smith@mail.com)
$
is the end of string anchor. It works the same as ^
above, except matches the end of the string.
With that in mind, it should be a bit clearer how to add a section with captures a plus segment. As we saw above, +
is a special character so it has to be escaped. Then, since the + has to be followed by some characters, we can define a character class with the characters we want to match and define its repetition. Finally, we should make the whole group optional because email addresses don't need to have a + segment:
(\+[a-z0-9-]+)?
When inserted into your regex, it'd look like this:
/^[_a-z0-9-]+(\.[_a-z0-9-]+)*(\+[a-z0-9-]+)?@[a-z0-9-]+(\.[a-z0-9-]+)*$/i