Search code examples
regexregex-lookaroundsregex-group

Proper regex for text in a post tagging system


I'm creating what's basically a SO clone as practice, and I'm trying to implement a tagging system, though I'm having some trouble with the regex for the tag names.

I'm trying to achieve the same result that StackOverflow has with its tags, that is:

  • Any combination of alphanumerical characters, case insensitive
  • 0 or 1 of ., - or _ followed by more alphanumericals
  • A maximum of 3 periods, dashes or underscores allowed in 1 tag

These should return a positive match:

exampletag
example-tag
ex-ample-tag
ex_ample_tag
ex-ample_tag
ex.am-ple_tag
Ex.4m-p1e_t4g

And these should return a negative, for the sake of the question assume that whitespace means the start of a new tag and can safely be left out at this point

ex-am-pl-et-ag // and variations where there's more than 3 `-` `_` or `.`
-exampletag // no starting symbols
exampletag- // no trailing symbols

I'm currently stuck at this point in the regex, and I'm unsure how to formulate it better/further

((\w+)(\-|\_|\.)?)\1?

And with my reasoning

(                    Capture the sequence of #2 and #3 into capture group #1
  (                  Capture group #2
    \w+              One or more alphanumericals
  )
  (                  Capture group #3
    \-|\_|\.         - _ or .
  )?                 0 or 1 of the preceding
)
  \1?                0 or 1 of capture group #1

The \1 part doesn't work quite like how I expected it to work, though. This will match something like example-, but the tag part will be a secondary hit, and I'm stuck on how to proceed from here.

Preferably I'd want this regex to work with the Ruby flavor of regex, but whatever the flavor it's fine.


Solution

  • Mind that \w matches letters, digits and also underscores. Thus, your check for the amount of underscores when using \w in the pattern will never be accurate. Besides, your pattern simply matches a sequence of one or more word chars followed with an optional -, _ or . and then \1? tries to optionally match the same value as captured into Group 1 immediately to the right of the current location.

    I suggest changing all \w to [^\W_] to exclude (subtract) _ from \w, a construct like a(?:ba){0,3} to match element-separated items, and use anchors, ^ and $ at least, to match start and end of a string.

    You can use

    ^[^\W_]+(?:[-_.][^\W_]+){0,3}$
    

    In Ruby, it must be written as

    \A[^\W_]+(?:[-_.][^\W_]+){0,3}\z
    

    Details

    • \A - start of string
    • [^\W_]+ - one or more word chars except _
    • (?:[-_.][^\W_]+){0,3} - zero, one, two or three occurrences of a -/_/. and then one or more word chars other than _
    • \z - end of string.

    See the regex demo.