Search code examples
c#regexregex-groupregex-greedy

Optimize regex to read date


I have developed a regex to use in a .NET WebAPI that gets a date and a control code from a given input already formatted in final format.

I tried regex to avoid using multiple string splits.

I've been using Regex101 to test my expression and I have one that already works as expected by I think it's too large for what it does.

Expression:

^([0-9]{2})+([0-9]{2})+([0-9]{2})[0-9](M|F)([0-9]{2})+([0-9]{2})+([0-9]{2})

// Get in format Year, Month, Day, Code(M|F), Year, Month, Day

Input:

7603259M2209058PRT<<<<<<<<<<<8

Do you have any suggestions to simplify it?


Solution

  • There is one issue with your regex: you quantified the two-digit matching capturing groups with a + quantifier, making them match one or more times. ([0-9]{2})+ matches one or more sequences of any two ASCII digits, while keeping the last captured value in the corresponding group. See Repeating a Capturing Group vs. Capturing a Repeated Group.

    You need to remove all + chars from your pattern and then you can also use the following:

    • Use \d to match any digit while passing the RegexOptions.ECMAScript option to the regex compile method so that it can only match ASCII digits (otherwise, \d will be equal to \p{Nd} and will match any Unicode digits, see \d less efficient than [0-9])
    • Instead of alterantion with single chars ((M|F)), use a character class, ([MF]), this is more efficient (see Why is a character class faster than alternation?).

    You can use

    var pattern = new Regex(@"^(\d{2})(\d{2})(\d{2})\d([MF])(\d{2})(\d{2})(\d{2})", RegexOptions.ECMAScript);
    

    See the .NET regex demo.

    If you want to use and even shorter regex you may use:

    var pattern = new Regex(@"^(?:(\d{2})){3}\d([MF])(?:(\d{2})){3}", RegexOptions.ECMAScript);
    var match = pattern.Match("7603259M2209058PRT<<<<<<<<<<<8");
    if (match.Success)
    {
        Console.WriteLine(match.Groups[1].Captures[0].Value); // => 76
        Console.WriteLine(match.Groups[1].Captures[1].Value); // => 03
        Console.WriteLine(match.Groups[1].Captures[2].Value); // => 25
        Console.WriteLine(match.Groups[2].Value);             // => M
        Console.WriteLine(match.Groups[3].Captures[0].Value); // => 22
        Console.WriteLine(match.Groups[3].Captures[1].Value); // => 09
        Console.WriteLine(match.Groups[3].Captures[2].Value); // => 05
    }
    

    See the C# demo and this regex demo.

    Note this is possible because .NET Regex allows access to all the captures inside the group stack.