(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:

530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I

The patterns that are used in this case are:

FOR  { return TOK_FOR; }
TO   { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_@][A-Za-z0-9_]*[\$%\!#]? {
          yylval.s = g_string_new(yytext);
          return IDENTIFIER;
        }
(many lines later...)
[ \t\r\l]   { /* eat non-string whitespace */ }

The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:

530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI

Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.

The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.

Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?

Solution

It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.

Here's a simple pattern for recognising keywords, using trailing context:

tail       [[:alnum:]]*[$%!#]?
%%
FOR/{tail}    { return TOK_FOR; }
TO/{tail}     { return TOK_TO; }
NEXT/{tail}   { return TOK_NEXT; }
  /* etc. */
[[:alpha:]]{tail}  { /* Handle an ID */ }

Effectively, that just extends the keyword match without extending the matched token.

But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?