I have a string that can contain either an email address or an IRI (internationalized URI). The strings do not contain additional surrounding whitespace or any HTTP linefolding characters. Moreover they do not contain any elements marked as "obsolete" in their corresponding specifications. I need a simple way to distinguish which of these things the string contains.
I'm looking at what I believe to be the latest respective specifications: RFC 5322 § 3.4.1. Addr-Spec Specification for emails, and RFC 3987 § 2.2. ABNF for IRI References and IRIs for IRIs. I've come up with the following algorithm, with explanations in parentheses:
"
character, it is an email address. (Email address local-part
may be a quoted string, but an IRI scheme
may not.)@
sign or colon :
character.
@
sign, the string contains an email address.:
character, the string contains an IRI.Is that approach correct? Is there another simpler approach? Lastly for bonus, how would I expand this algorithm to also distinguish those two things from an IP address (including both IPv4 and IPv6)?
I would think the rules as specified are correct and fast to determine the type (email or IRI). To extend this to IP addresses their corresponding grammar should be added: https://datatracker.ietf.org/doc/html/draft-main-ipaddr-text-rep-00.
So then your rules could be extended to:
Rules: (I assumed well formed input)
"
=> email:
=> IpV6 (because an IRI the scheme has to contain at least one char):
or @
@
=> email
:
=>
If it does not match the grammar for IpV6 => IRI
Otherwise: ambiguous, also in the grammar, some options
Use as IpV6 => it will be valid, likely to be the thing intended
Use it as IRI => the first part (before the ':') will be a scheme the later part will be one 'segment' in the protocol
So ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff
will lead to scheme ffff
and 'segment' ffff:ffff:ffff:ffff:ffff:ffff:ffff
I would find this situation very unlikely
Raise an exception, depending on the environment this could be a valid option
Both not in the string => IpV4
ipchar := hex / ':'
hex := [0-9A-Fa-f]