I m trying to tokenize bunch of code with lex and match different types of keywords with different regexes. When following regex matches, it tokenizes whatever it matched with "VARIABLE";
[_a-zA-Z][_a-zA-Z0-9]*
And following matches with the print statement;
\s*print\((.*?)\)\s*
What I need is when following statement go through lexical analysis;
myVar_12
print(myVar_12)
Tokens should be like;
VARIABLE
PRINT VARIABLE
But what I get is;
VARIABLE
PRINT
I started to learn about regex like yesterday and could not figure out what should I do. So please pardon my meaningless regexes.
You've clarified in a comment that you want print
to be a keyword regardless of whether it's followed by a parameter list or not. Therefore the parameter list should not be part of print
's regex¹. The regex to match print
should simply be print
.
print return PRINT;
[_a-zA-Z][_a-zA-Z0-9]* return VARIABLE;
Note that the order matters because the input "print" could be matched by both regular expressions and if multiple regular expressions produce a match of the same length, lex
uses the one that comes first in the file. So keywords should always be defined before the rule for identifiers.
You'll also want to recognize parentheses as their own tokens and to ignore white space (presumably).
¹ In fact it should be part of the regex either way. But if you wanted print
to be a contextual keyword, you'd need a different solution.