Search code examples
ruby-on-railsregextestingruby-on-rails-4railstutorial.org

rails email validation format and regex


Currently following the Michael Hartl rails tutorial

Given the following tests in rails

  test "email validation should accept valid addresses" do
    valid_addresses = %w[user@example.com USER@foo.COM A_US-ER@foo.bar.org
                         first.last@foo.jp alice+bob@baz.cn]
    valid_addresses.each do |valid_address|
      @user.email = valid_address
      assert @user.valid?, "#{valid_address.inspect} should be valid"
    end
  end

  test "email validation should reject invalid addresses" do
    invalid_addresses = %w[user@example,com user_at_foo.org user.name@example.
                           foo@bar_baz.com foo@bar+baz.com]
    invalid_addresses.each do |invalid_address|
      @user.email = invalid_address
      assert_not @user.valid?, "#{invalid_address.inspect} should be invalid"
    end
  end

and the following regex for email format validation

VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i
validates :email, presence: true, format: { with: VALID_EMAIL_REGEX }

Can someone explain to me what the tests are testing with respect to the regex? Why are the valid tests only user@example.com, USER@foo.COM, and so on. What if i add another element to valid_addresses that's USER@EXAMPLE.COM. Why did Michael specifically choose the above 5 example emails as valid_addresses and 5 invalid_addresses?

If the regex tests for all formats and only returns a specific one, why do we need to test at all?


Solution

  • Let us break down the expression (keep in mind the i modifier makes it case insensitive):

    \A          (?# anchor to the beginning of the string)
    [\w+\-.]+   (?# match 1+ a-z, A-Z, 0-9, +, _, -, or .)
    @           (?# match literal @)
    [a-z\d\-.]+ (?# match 1+ a-z, 0-9, -, or .)
    \.          (?# match literal .)
    [a-z]+      (?# match 1+ a-z)
    \z          (?# anchor to the absolute end of the string)
    

    This is what the tutorial defines as an email (in reality, it's much more complicated). So the author, Michael Hartl, wrote a couple tests for "valid" and "invalid" (according to the above definitions) emails.

    Pretty much the "user" can be alphanumeric or contain _+-.. The "domain" can be alphanumeric or -.. And the "TLD" can only be letters. The first 5 emails use many variations of these previous rules as "acceptable" emails. The last 5 emails fail for the following reasons:

    • user@example,com - , can't be matched
    • user_at_foo.org - no @
    • user.name@example. - no TLD after .
    • foo@bar_baz.com - domain can't contain _
    • foo@bar+baz.com - domain can't contain +

    Obviously if you want more specific emails to match (or not match) add them to the array of tests. If your test fails, you know you will need to update your expression :)