Search code examples
c#regexstring-parsing

Parsing a log string, without :Split


Searched around a bit, but I only found cases where splitting by comma's or so would work. This case is different.

To explain my problem I'll show a tiny example:

JAN 01 00:00:01 <Admin> Action, May have spaces etc.

(This is a log entry)

I'd like to parse this string into several variables. The first bit is obviously a date, without year. Between the <>'s the login name is listed, and behind the log entry.

The configuration should have something like this:

{month} {day} {hour}:{minute}:{second} <{login}> {the_rest}

This will allow changes without having the whole thing hardcoded (using splits etc).

I think using Regex may be useful here, but I do not really know a lot about it and if it'd be usable in this case at all. Speed does not matter a lot, yet I don't really know how to achieve this.

Thanks,

~Tgys


Solution

  • Regular expressions are indeed the correct tool here. First, let's see how you can use a hardcoded regular expression to parse this log.

    Parsing with a hardcoded regular expression

    var str = "JAN 01 00:00:01 <Admin> Action, May have spaces etc.";
    var re = new Regex("^" +
           @"(?<month>(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC))" +
           " " +
           @"(?<day>\d+)" +
           " " +
           @"(?<hour>\d+)" +
           ":" +
           @"(?<the_rest>.*)" +
           "$");
    var match = re.Match(str);
    

    What we did here is create a regular expression piece-by-piece using named capturing groups. I didn't capture all the relevant information for brevity, and I didn't spend too much time in considering what is valid input in the context of each group (e.g. day will match 999, although that's not a valid day). All this can come later; for now, see it in action.

    Constructing the regular expression from predefined pieces

    The next step is to nicely pull out the definition of each capturing group into a dictionary:

    var groups = new Dictionary<string, string>
    {
        { "month", "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)" },
        { "day", @"\d+" },
        { "hour", @"\d+" },
        { "the_rest", ".*" },
    };
    

    Given this, we can now construct the same regex with

    var re = new Regex("^" +
           string.Format("(?<month{0}>)", groups["month"]) +
           " " +
           string.Format("(?<day{0}>)", groups["day"]) +
           " " +
           string.Format("(?<hour{0}>)", groups["hour"]) +
           ":" +
           string.Format("(?<the_rest{0}>)", groups["the_rest"]) +
           "$");
    

    OK, this is starting to look like something that can be constructed dynamically.

    Constructing the regular expression based on user-supplied specification

    Let's say we want to construct it from a specification that looks like

    "{month} {day} {hour}:{the_rest}"
    

    How to do this? With another regular expression! Specifically, we will use the overload of Regex.Replace that enables replacement of a match with the result of a function:

    var format = "{month} {day} {hour}:{the_rest}";
    var result = Regex.Replace(format, @"\{(\w+)\}", m => groups[m.Groups[1].Value]);
    

    See this in action before coming back.

    Using the regular expression to parse the input

    At this point, we can pass in a format specification and get back a regular expression that matches the input based on this format. What's left? To translate the results of matching the regular expression to the input back to a "dynamic" structure:

    var format = "{month} {day} {hour}:{the_rest}";
    var re = Regex.Replace(format,
                           @"\{(\w+)\}",
                           m => string.Format("(?<{0}>{1})", m.Groups[1].Value, groups[m.Groups[1].Value]));
    var regex = new Regex("^" + re + "$", RegexOptions.ExplicitCapture);
    var match = regex.Match(str);
    

    Pulling the final results out

    At this point:

    • we can test match.Success to see if the dynamically constructed expression matches the input
    • we can iterate over regex.GetGroupNames() to get the names of the groups used in parsing
    • we can iterate over match.Groups to get the results of parsing each group

    So let's put them in a dictionary:

    var results = regex.GetGroupNames().ToDictionary(n => n, n => match.Groups[n].Value);
    

    Success!

    You can now create a method Parse that allows this:

    var input = "JAN 01 00:00:01 <Admin> Action, May have spaces etc.";
    var format = "{month} {day} {hour}:{the_rest}";
    var results = Parse(input, format);
    

    Parse will recognize (but not allow the user to modify) expressions such as "{month}", while at the same time allowing the user to mix and match these expressions freely in order to parse the input.

    See the final result.