Search code examples
c#regexstring-matching

Use regex to match string in given condition


[Edit] Note:

Shortest regex is the main question, not about back-reference.


Requirement:

Use shortest regex to match all string in following format:

<two digits><connect char><three digits><connect char><four digits>

For easy reading:

<two digits>
<connect char>
<three digits>
<connect char>
<four digits>

Conditions:

  • Match whole string, given that input string is single line.
  • Connect char may both omit, or are any of [-./ ] (not include []).
  • Two connect char must be the same in each matched string.
  • Shortest is important, performance is not important.

Example

Some valid string:

55.635.8828
72/683/1582
86 942 7682
581827998      // Both connect chars is omit

Some invalid string:

56.855/9856     // Two connect chars are different.
56 4559428      // Same as above

This short regex will match all valid string:

^\d{2}[-./ ]?\d{3}[-./ ]?\d{4}$

But it also match invalid ones:

52-355/9984

This regex will match all correct string, but quite long. I break it to multi line for easy reading:

^(\d{2}-?\d{3}-?\d{4})|
(\d{2}\.?\d{3}\.?\d{4})|
(\d{2}/?\d{3}/?\d{4})|
(\d{2} ?\d{3} ?\d{4})$

Can you suggest me a shorter regex that meet the requirement?


Solution

  • You may capture the separator and use a backreference instead of repeating the pattern

    ^\d\d([-./ ]?)\d{3}\1\d{4}$
         ^       ^     ^^
    

    See the regex demo

    In C#:

    var isValid = Regex.IsMatch(s, @"^\d\d([-./ ]?)\d{3}\1\d{4}$");
    

    Pass the RegexOptions.ECMAScript option to the regex compiler if you only want to match ASCII digits with \d (that, by default in .NET regex, matches all Unicode digits).

    Pattern details

    • ^ - start of string
    • \d\d - any 2 digits
    • ([-./ ]?) - Group 1 capturing 1 or 0 -, ., / or space
    • \d{3} - any 3 digits
    • \1 - the same value as captured in Group 1
    • \d{4} - any 4 digits
    • $ - end of string (or you might want to use \z to ensure the exact end of the string, but it is not necessary in the majority of cases).