Search code examples
c#regexlinqlexical-analysis

Fetch group name with linq from regex matches


I am trying to build a very simplified lexer using regex and named groups in c#.

I can get all the matched tokens along with position just fine. But I cannot find a way to get the matched group name also.

I was planning to use that as the token type.

Here is a small example designed to lex simple sql.

var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany (m => m
    .Groups
    .Cast<Group>()
    .SelectMany (g => g
        .Captures
        .Cast<Capture>()
        .Select (c => new {c.Index, c.Length, c.Value})))
.Skip(1)
.Where (m => m.Length > 0)
.OrderBy (m => m.Index);

This returns a small result like this:

0 6 Select 
7 1 * 
9 4 from 
14 9 items  
24 5 where  
30 2 id 
33 1 >  
35 4 '10' 

But how can I get the capture group names into the table, is it possible?

This is not a home work exercise or any type of school work, its an experiment I am doing for a simple automation api for one of our products.

I can probably rewrite it using a more verbose solution but I kind of like the "on liner approach" of this one ;)

And if all else fails I already have a full lexer using real classes and much more advanced pattern matching, but that is not really required for this :D

UPDATE! I know what groups are available, what I like to get is, for each capture in the result, which group was it that caught it.

As the first comment refers to, there is a method to get all groups from a regex, but then you have to fetch the results by the group, there does not seem to be a way to get the group from the capture.


Solution

  • [Appended a new solution I found following the link to the possible duplicate]

    The answer to my question seems to be that it is not possible to get group names in any way except from the regex object.

    I used part of the solution from the first comment reference to work around this but I would have liked to be able to go the more direct route.

    Here is the solution I ended up with. (uses Linqpad dump)

    var source = "select * from people where id > 10";
    
    var re = new Regex(@"
        (?:
        (?<reserved>select|from|where|and|or|null|is|not)|
        (?<string>'[^']*')|
        (?<number>\d+)|
        (?<identifier>[a-z][a-z_0-9]+|\[[^\]]+\])|
        (?:\s+)|
        (?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*|,|.)|
        (?<other>.*)
        )+
        ", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Compiled);
        
    (
        from name 
        in re.GetGroupNames() 
        select new {name = name, captures = re.Match(source).Groups[name].Captures}
    )
    .Where (r => r.name != "0")
    .SelectMany (r => (
        from Capture c 
        in r.captures 
        where c.Length > 0
        select new {Type = r.name, Index = c.Index, Length = c.Length, Value = c.Value}
        )
    ).OrderBy (r => r.Index).ToList().Dump();
    

    Based on a comment on the possible duplicate, fro NET 4.7 Group now have a Name property which was not present when I made this test so in case anyone stumbles upon this and is not discouraged enough here is a version that does what I originally tried but no longer need for anything :)

    var matches = Regex.Matches("Select * from items where id > '10'", @"
    (?:
    (?<string>'[^']*')|
    (?<number>\d+)|
    (?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
    (?:\s+)|
    (?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
    (?<other>.*)
    )+
    ", RegexOptions.IgnorePatternWhitespace)
    .Cast<Match>()
    .SelectMany(m => m
       .Groups
       .Cast<Group>()
       .SelectMany(g => g
          .Captures
          .Cast<Capture>()
          .Select(c => new { c.Index, c.Length, c.Value, g.Name })))
    .Skip(1)
    .Where(m => m.Length > 0)
    .OrderBy(m => m.Index).Dump();