Search code examples
regextokenlex

How to match with both substring and string itself?


I m trying to tokenize bunch of code with lex and match different types of keywords with different regexes. When following regex matches, it tokenizes whatever it matched with "VARIABLE";

[_a-zA-Z][_a-zA-Z0-9]*

And following matches with the print statement;

\s*print\((.*?)\)\s*

What I need is when following statement go through lexical analysis;

myVar_12
print(myVar_12)

Tokens should be like;

VARIABLE
PRINT VARIABLE

But what I get is;

VARIABLE
PRINT

I started to learn about regex like yesterday and could not figure out what should I do. So please pardon my meaningless regexes.


Solution

  • You've clarified in a comment that you want print to be a keyword regardless of whether it's followed by a parameter list or not. Therefore the parameter list should not be part of print's regex¹. The regex to match print should simply be print.

    print                   return PRINT;
    [_a-zA-Z][_a-zA-Z0-9]*  return VARIABLE;
    

    Note that the order matters because the input "print" could be matched by both regular expressions and if multiple regular expressions produce a match of the same length, lex uses the one that comes first in the file. So keywords should always be defined before the rule for identifiers.

    You'll also want to recognize parentheses as their own tokens and to ignore white space (presumably).


    ¹ In fact it should be part of the regex either way. But if you wanted print to be a contextual keyword, you'd need a different solution.