I am trying to build a very simplified lexer using regex and named groups in c#.
I can get all the matched tokens along with position just fine. But I cannot find a way to get the matched group name also.
I was planning to use that as the token type.
Here is a small example designed to lex simple sql.
var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany (m => m
.Groups
.Cast<Group>()
.SelectMany (g => g
.Captures
.Cast<Capture>()
.Select (c => new {c.Index, c.Length, c.Value})))
.Skip(1)
.Where (m => m.Length > 0)
.OrderBy (m => m.Index);
This returns a small result like this:
0 6 Select
7 1 *
9 4 from
14 9 items
24 5 where
30 2 id
33 1 >
35 4 '10'
But how can I get the capture group names into the table, is it possible?
This is not a home work exercise or any type of school work, its an experiment I am doing for a simple automation api for one of our products.
I can probably rewrite it using a more verbose solution but I kind of like the "on liner approach" of this one ;)
And if all else fails I already have a full lexer using real classes and much more advanced pattern matching, but that is not really required for this :D
UPDATE! I know what groups are available, what I like to get is, for each capture in the result, which group was it that caught it.
As the first comment refers to, there is a method to get all groups from a regex, but then you have to fetch the results by the group, there does not seem to be a way to get the group from the capture.
[Appended a new solution I found following the link to the possible duplicate]
The answer to my question seems to be that it is not possible to get group names in any way except from the regex object.
I used part of the solution from the first comment reference to work around this but I would have liked to be able to go the more direct route.
Here is the solution I ended up with. (uses Linqpad dump)
var source = "select * from people where id > 10";
var re = new Regex(@"
(?:
(?<reserved>select|from|where|and|or|null|is|not)|
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-z][a-z_0-9]+|\[[^\]]+\])|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*|,|.)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Compiled);
(
from name
in re.GetGroupNames()
select new {name = name, captures = re.Match(source).Groups[name].Captures}
)
.Where (r => r.name != "0")
.SelectMany (r => (
from Capture c
in r.captures
where c.Length > 0
select new {Type = r.name, Index = c.Index, Length = c.Length, Value = c.Value}
)
).OrderBy (r => r.Index).ToList().Dump();
Based on a comment on the possible duplicate, fro NET 4.7 Group now have a Name property which was not present when I made this test so in case anyone stumbles upon this and is not discouraged enough here is a version that does what I originally tried but no longer need for anything :)
var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany(m => m
.Groups
.Cast<Group>()
.SelectMany(g => g
.Captures
.Cast<Capture>()
.Select(c => new { c.Index, c.Length, c.Value, g.Name })))
.Skip(1)
.Where(m => m.Length > 0)
.OrderBy(m => m.Index).Dump();