Search code examples
uriip-addressemail-addressiri

Distinguish between email address and IRI


I have a string that can contain either an email address or an IRI (internationalized URI). The strings do not contain additional surrounding whitespace or any HTTP linefolding characters. Moreover they do not contain any elements marked as "obsolete" in their corresponding specifications. I need a simple way to distinguish which of these things the string contains.

I'm looking at what I believe to be the latest respective specifications: RFC 5322 § 3.4.1. Addr-Spec Specification for emails, and RFC 3987 § 2.2. ABNF for IRI References and IRIs for IRIs. I've come up with the following algorithm, with explanations in parentheses:

  1. If the string begins with a quote " character, it is an email address. (Email address local-part may be a quoted string, but an IRI scheme may not.)
  2. Otherwise find the first at @ sign or colon : character.
    • If the character encountered is an at @ sign, the string contains an email address.
    • Otherwise, if it is a colon : character, the string contains an IRI.

Is that approach correct? Is there another simpler approach? Lastly for bonus, how would I expand this algorithm to also distinguish those two things from an IP address (including both IPv4 and IPv6)?


Solution

  • I would think the rules as specified are correct and fast to determine the type (email or IRI). To extend this to IP addresses their corresponding grammar should be added: https://datatracker.ietf.org/doc/html/draft-main-ipaddr-text-rep-00.

    So then your rules could be extended to:

    Rules: (I assumed well formed input)

    • First char " => email
    • First char : => IpV6 (because an IRI the scheme has to contain at least one char)
    • First of : or @
      • @ => email

      • : =>

        • If it does not match the grammar for IpV6 => IRI

        • Otherwise: ambiguous, also in the grammar, some options

          1. Use as IpV6 => it will be valid, likely to be the thing intended

          2. Use it as IRI => the first part (before the ':') will be a scheme the later part will be one 'segment' in the protocol

            • So ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff will lead to scheme ffff and 'segment' ffff:ffff:ffff:ffff:ffff:ffff:ffff

            • I would find this situation very unlikely

          3. Raise an exception, depending on the environment this could be a valid option

      • Both not in the string => IpV4

    ipchar := hex / ':'
    hex    := [0-9A-Fa-f]