Search code examples
c#regexrecursive-regex

Recursive RegEx to match keys and name


I have the strings, ["02-03-2013#3rd Party Fuel", "-1#Archived", "2#06-23-2013#Newswire"], which I want to break down into several parts. These strings are prefixed with date and index keys and contain a name.

I've design a RegEx that matches each key properly. However, if I want to match the index key, date key, and name in fell swoop. Only the first key is found. It seems the recursive group isn't working as I expect it should.

private const string INDEX_KEY_REGEX = @"(?<index>-?\d+)";
private const string DATE_KEY_REGEX = @"(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|3[01])-\d{4})";
private const string KEY_SEARCH_REGEX = @"(?<R>(?:^|(?<=#))({0})#(?(R)))(?<name>.*)";

private string Name = "2#06-23-2013#Newswire"
... = Regex.Replace(
    Name,
    String.Format(KEY_SEARCH_REGEX, INDEX_KEY_REGEX + "|" + DATE_KEY_REGEX),
    "${index}, ${date}, ${name}"
);

// These are the current results for all strings when set into the Name variable.

// Correct Result: ", 02-03-2013, 3rd Party Fuel"
// Correct Result: "-1, , Archived"
// Invalid Result: "2, , 06-23-2013#Newswire"
// Should be: "2, 06-23-2013, Newswire"

Does a keen eye see something I've missed?


Final Solution As I Needed It

It turns out I didn't need a recursive group. I simply needed 0 to many sequence. Here is the full RegEx.

(?:(?:^|(?<=#))(?:(?<index>-?\d+)|(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|3[01])-(\d{2}|\d{4})))#)*(?<name>.*)

And, the segmented RegEx

private const string INDEX_REGEX = @"(?<index>-?\d+)";
private const string DATE_REGEX = @"(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|3[01])-(\d{2}|\d{4}))";
private const string KEY_WRAPPER_REGEX = @"(?:^|(?<=#))(?:{0})#";
private const string KEY_SEARCH_REGEX = @"(?:{0})*(?<name>.*)";

Solution

  • well, the individual regexs break down into this:

    Index: Capture a single positive or negative number. (-, 0 or 1 rep, followed by one or more digits)

    date: Specified date string, separated with -. No allowance made for any other date format. Note, the leading '#' and trailing '#' are not handled, it specifically captures the date, and only the date

    R: beginning of line OR #, then the formatting replacement to make it one BIG regex...then another #, specified. then a conditional with no false...and true doesn't do anything either.

    name: capture whatever is left.

    final result, compiled into a single regex.... two captures: R and name. R: (4 parts) R-1: Match either beginning of line or # R-2: Get EITHER (but never both) Date or Index R-3: match # R-4: Empty Conditional Expression name: match whatever is left.

    The issue seems to be that you are not matching both index and date

    final edit, working regex

    Bear with me, this thing is nasty. You have to account for all 4 possibilities, or it wont match every possible case. I couldn't figure out any way to generalize it.

    (?:(?<index>-?\d+(?!\d-))#(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|:3[01])-\d{4})|(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|:3[01])-\d{4})#(?<index>-?\d+)|(?!-?\d+#)(?<date>(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|:3[01])-\d{4})|(?<index>-?\d+)(?!#(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|:3[01])-\d{4}))#(?<name>.*)
    

    ugly, i know. It has 4 initial conditions.

    1a) capture <index>#<date>  OR
    1b) capture <date>#<index>  OR
    1c) capture <index> only, as long as its not followed by a date  OR
    1d) capture <date> only, as long as its not preceded by an index
    ...
    2) match but ignore #
    3) capture <name>
    

    works in all 4 cases.

    Final: Final Edit

    There is a way to do this using 3 regexs instead of just 1, which might end up being cleaner.

    //note: index MIGHT be preceeded by, and is ALWAYS followed by, a #
    indexRegex = @"((?=#)?(?<!\d|-)-?\d+(?=#))";
    //same with date
    dateRegex = @"((?=#)?(?:0?[1-9]|1[012])-(?:0?[1-9]|[12]\d|3[01])-\d{4}(?=#))";
    //then name
    nameRegex = @"(?:.*#){1,2}(.*)";
    

    run them each separately against a replace to get the individual variables, then rebuild the string.