I am struggling to find a Regex which could match a URN as described in rfc8141. I have tried this one:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[a-z0-9()+,-.:=@;$_!*']|%[0-9a-f]{2})+))\z
but this one only matches the first part of the URN without the components.
For example lets say we have the corresponding URN: urn:example:a123,0%7C00~&z456/789?+abc?=xyz#12/3
We should match the following groups:
I haven't read all the specifications, so there may be other rules to implement, but it should put you on the way for the optional components:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[-a-z0-9()+,.:=@;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
explanations:
(?<nss>(?:[-a-z0-9()+,.:=@;$_!*'&~\/]|%[0-9a-f]{2})+)
: The -
has been moved to the beginning of the list to be considered in the allowed chars, or else it means "range from ,
to .
". The characters &
, ~
and /
(has to be escaped with "\") have also been added to the list, or else it won't match your example.(?:\?\+(?<rcomponent>.*?))?
: inside an optional non-capturing group (?:)?
to prevent capturing the identifier (the ?+
, ?=
and #
part). The chars ?
and +
have to be escaped with "\". Will capture anything (.
) but in lazy mode (*?
) or else the first component found would capture everything until the end of the string.See working example in Regex101
Hope that helps