Search code examples
javaregextokenizelexer

While tokenizing the following string 40 println "Hello ",(5+6-4), "-4" is showing a single token and not separate one


I am writing a lexer in java for a custom base language. For the following line 40 println "Hello ",(5+6-4) I want the output as

40
println
"Hello "
,
(
5
+
6
-
4
)

Everything else is coming alright, but for some reason i am getting - and 4 together "-4" as a token.

Regex used:

For Numbers -?[0-9]+
Special operator / Characters: [\\[|\\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]

Regex for Number without the leading "-" is showing error at char 89 which is start of ?[0-9]+

dangling Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 89 ((?<Reserved>\bPRINTLN\b|\bPRINT\b|\bINTEGER\b|\bINPUT\b|\bEND\b|\bLET\b))|((?<Constants>?[0-9]+))|((?<Special>[\[|\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]))|((?<Literals>"[^"]*"))|((?<Identifiers>\w+))

I am storing the regex in a string and using named capturing grouping to identify the tokens.


Solution

  • (?<Constants>?[0-9]+) - This part in your regex seems to be the problem. The ? following the capture group name is a dangling one.

    Also, there is no need to separate a character class members using |.

    Based on the error you shared, the following would be what you want:

        String regex = "((?<Reserved>\\bPRINTLN\\b|\\bPRINT\\b|\\bINTEGER\\b|\\bINPUT\\b|\\bEND\\b|\\bLET\\b))|((?<Constants>[0-9]+))|((?<Special>[\\[\\]/.$*\\-+=><#()%,!|&{|}]))|((?<Literals>\"[^\"]*\"))|((?<Identifiers>\\w+))";
        String s = "40 println \"Hello \",(5+6-4) ";
        Matcher matcher = Pattern.compile(regex).matcher(s);
        while(matcher.find()) {
            System.out.println(matcher.group());
        }
    

    I have removed the dangling ? mentioned above, removed the |s used for separation inside character class and escaped the - inside the character class (alternatively you can move the - to the end of the character class).