Search code examples
c++regexatlemail-address

How to validate e-mail address with C++ using CAtlRegExp


I need to be able to validate various formats of international email addresses in C++. I've been finding many of the answers online don't cut it and I found a solution that works well for me that I thought I would share for anyone that is using ATL Server Library

Some background. I started with this post: Using a regular expression to validate an email address. Which pointed to http://emailregex.com/ that had a regular expression in various languages that supports the RFC 5322 Official Standard of the internet messaging format.

The regular expression provided is

(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")@(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

I'm using C++ with ATL Server Library which once upon a time used to be part of Visual Studio. Microsoft has since put it on CodePlex as open source. We use it still for some of the template libraries. My goal is to modify this regular expression so it works with CAtlRegEx


Solution

  • The regular expression engine (CAtlRegExp) in ATL is pretty basic. I was able to modify the regular expression as follows:

    ^{([a-z0-9!#$%&'+/=?^_`{|}~\-]+(\.([a-z0-9!#$%&'+/=?^_`{|}~\-]+))*)@(((a-z0-9?\.)+a-z0-9?)|(\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\]))}$

    The only thing that appears to be lost is Unicode support in domain names which I was able to solve by following the C# example in the How to: Verify that Strings Are in Valid Email Format article on MSDN by using IdnToAscii.

    In this approach the user name and domain name are extracted from the email address. The domain name is converted to Ascii using IdnToAscii and then the two are put back together and then ran through the regular expression.

    Please be aware that error handling was omitted for readability. Code is needed to make sure there are no buffer overruns and other error handling. Someone passing an email address over 255 characters will cause this example to crash.

    Code:

    bool WINAPI LocalLooksLikeEmailAddress(LPCWSTR lpszEmailAddress) 
    {
        bool bRetVal = true ;
        const int ccbEmailAddressMaxLen = 255 ;
        wchar_t achANSIEmailAddress[ccbEmailAddressMaxLen] = { L'\0' } ;
        ATL::CAtlRegExp<> regexp ;
        ATL::CAtlREMatchContext<> regexpMatch ;
        ATL::REParseError status  = regexp.Parse(L"^{.+}@{.+}$", FALSE) ;
        if (status == REPARSE_ERROR_OK) {
            if (regexp.Match(lpszEmailAddress, &regexpMatch) && regexpMatch.m_uNumGroups == 2) {
                const CAtlREMatchContext<>::RECHAR* szStart = 0 ;
                const CAtlREMatchContext<>::RECHAR* szEnd   = 0 ;
                regexpMatch.GetMatch(0, &szStart, &szEnd) ;
                ::wcsncpy_s(achANSIEmailAddress, szStart, (size_t)(szEnd - szStart)) ;
                regexpMatch.GetMatch(1, &szStart, &szEnd) ;
                wchar_t achDomainName[ccbEmailAddressMaxLen] = { L'\0' } ;
                ::wcsncpy_s(achDomainName, szStart, (size_t)(szEnd - szStart)) ;
    
                if (bRetVal) {
                    wchar_t achPunycode[ccbEmailAddressMaxLen] = { L'\0' } ;
                    if (IdnToAscii(0, achDomainName, -1, achPunycode, ccbEmailAddressMaxLen) == 0)
                        bRetVal = false ;
                    else {
                        ::wcscat_s(achANSIEmailAddress, L"@") ;
                        ::wcscat_s(achANSIEmailAddress, achPunycode) ;
                    }
                }
            }
        } 
    
        if (bRetVal) {
            status = regexp.Parse(
                L"^{([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+(\\.([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+))*)@((([a-z0-9]([a-z0-9\\-]*[a-z0-9])?\\.)+[a-z0-9]([a-z0-9\\-]*[a-z0-9])?)|(\\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\]))}$"
                , FALSE) ;
            if (status == REPARSE_ERROR_OK) {
                bRetVal = regexp.Match(achANSIEmailAddress, &regexpMatch) != 0;
            } 
        }
    
        return bRetVal ;
    }
    

    One thing worth mentioning is this approach did not agree with the results in the C# MSDN article for two of the email addresses. Looking the original regular expression listed on http://emailregex.com suggests that the MSDN Article got it wrong, unless the specification has recently been changed. I decided to go with the regular expression mentioned on http://emailregex.com

    Here's my unit tests using the same email addresses from the MSDN Article

    #include <Windows.h>
    #if _DEBUG
    #define TESTEXPR(expr) _ASSERTE(expr)
    #else
    #define TESTEXPR(expr) if (!(expr)) throw ;
    #endif
    
    void main()
    {
        LPCWSTR validEmailAddresses[] = {   L"david.jones@proseware.com", 
                                            L"d.j@server1.proseware.com",
                                            L"jones@ms1.proseware.com", 
                                            L"j@proseware.com9", 
                                            L"js#internal@proseware.com",
                                            L"j_9@[129.126.118.1]", 
                                            L"js*@proseware.com",            // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is invalid
                                                                             // but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not. 
                                                                             // I'm going with valid
                                            L"js@proseware.com9", 
                                            L"j.s@server1.proseware.com",
                                            L"js@contoso.中国", 
                                            NULL } ;
    
        LPCWSTR invalidEmailAddresses[] = { L"j.@server1.proseware.com",
                                            L"\"j\\\"s\\\"\"@proseware.com", // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is valid
                                                                             // but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not. 
                                                                             // I'm going with Invalid
                                            L"j..s@proseware.com",
                                            L"js@proseware..com",
                                            NULL } ;
    
        for (LPCWSTR* emailAddress = validEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
        {
            TESTEXPR(LocalLooksLikeEmailAddress(*emailAddress)) ;
        }
        for (LPCWSTR* emailAddress = invalidEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
        {
            TESTEXPR(!LocalLooksLikeEmailAddress(*emailAddress)) ;
        }
    }