Search code examples
c#.netregexurn

Regex which matches URN by rfc8141


I am struggling to find a Regex which could match a URN as described in rfc8141. I have tried this one:

\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[a-z0-9()+,-.:=@;$_!*']|%[0-9a-f]{2})+))\z

but this one only matches the first part of the URN without the components.

For example lets say we have the corresponding URN: urn:example:a123,0%7C00~&z456/789?+abc?=xyz#12/3 We should match the following groups:

  • NID - example
  • NSS - a123,0%7C00~&z456/789 (from the last ':' tll we match '?+' or '?=' or '#'
  • r-component - abc (from '?+' till '?=' or '#'')
  • f-component - 12/3 (from '#' till end)

Solution

  • I haven't read all the specifications, so there may be other rules to implement, but it should put you on the way for the optional components:

    \A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[-a-z0-9()+,.:=@;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
    

    explanations:

    • (?<nss>(?:[-a-z0-9()+,.:=@;$_!*'&~\/]|%[0-9a-f]{2})+) : The - has been moved to the beginning of the list to be considered in the allowed chars, or else it means "range from , to .". The characters &, ~ and / (has to be escaped with "\") have also been added to the list, or else it won't match your example.
    • optional components: (?:\?\+(?<rcomponent>.*?))? : inside an optional non-capturing group (?:)? to prevent capturing the identifier (the ?+, ?= and # part). The chars ? and + have to be escaped with "\". Will capture anything (.) but in lazy mode (*?) or else the first component found would capture everything until the end of the string.

    See working example in Regex101

    Hope that helps