I am writing a lexer in java for a custom base language. For the following line 40 println "Hello ",(5+6-4) I want the output as
40
println
"Hello "
,
(
5
+
6
-
4
)
Everything else is coming alright, but for some reason i am getting - and 4 together "-4" as a token.
Regex used:
For Numbers -?[0-9]+
Special operator / Characters: [\\[|\\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]
Regex for Number without the leading "-" is showing error at char 89 which is start of ?[0-9]+
dangling Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 89 ((?<Reserved>\bPRINTLN\b|\bPRINT\b|\bINTEGER\b|\bINPUT\b|\bEND\b|\bLET\b))|((?<Constants>?[0-9]+))|((?<Special>[\[|\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]))|((?<Literals>"[^"]*"))|((?<Identifiers>\w+))
I am storing the regex in a string and using named capturing grouping to identify the tokens.
(?<Constants>?[0-9]+)
- This part in your regex seems to be the problem. The ?
following the capture group name is a dangling one.
Also, there is no need to separate a character class members using |
.
Based on the error you shared, the following would be what you want:
String regex = "((?<Reserved>\\bPRINTLN\\b|\\bPRINT\\b|\\bINTEGER\\b|\\bINPUT\\b|\\bEND\\b|\\bLET\\b))|((?<Constants>[0-9]+))|((?<Special>[\\[\\]/.$*\\-+=><#()%,!|&{|}]))|((?<Literals>\"[^\"]*\"))|((?<Identifiers>\\w+))";
String s = "40 println \"Hello \",(5+6-4) ";
Matcher matcher = Pattern.compile(regex).matcher(s);
while(matcher.find()) {
System.out.println(matcher.group());
}
I have removed the dangling ?
mentioned above, removed the |
s used for separation inside character class and escaped the -
inside the character class (alternatively you can move the -
to the end of the character class).