Search code examples
parsingcompiler-constructionlexer

How to tell a language parser scanner that a string is a literal and not an identifier by looking at the previous token


I'm writing a query language parser and I'm currently facing an issue that I have not seen in programming language compilers.

I have the following query:

status=acknowledged

And I'm expecting status to be a variable identifier and acknowledged to be a string literal. The only way to determine that is by looking at the = operator in between.

In theory the scanner should be able to return a new token without looking at the previous ones, but if that's the case, how would you distinguish an identifier from a string literal? Bear in mind that I cannot just add double-quotes around the acknowledged string as I need to work with a query language that already exists and I'm not allowed to make any changes to it.

Should I just keep track of the previous N tokens inside the scanner and act accordingly?

Things get even messier if I have something like this:

status=acknowledged&visibility=all

In this case both = and & are operators so I can't just say that if the token before the one I'm currently parsing is an operator I should consider this as a string literal.


Solution

  • The question you need to ask yourself is "Does my parser need to know whether acknowledged is a variable identifier or a string literal?"

    And I'm going to venture to suggest that the answer is, "No, it doesn't". You can parse an expression like status=acknowledged&visibility=all without knowing anything about status and acknowledged (or visibility and all) other than that they are operands. A possible lexical category for such operands might be "bare words" (the term comes from Perl) or "atoms" (Lisp).

    Of course, at some point you will want to figure out what these tokens mean (which is, by definition, a semantic question) and at that point some of them will be resolved to "variable name" and others to (unquoted) "string literal". If, for example, your = operator insists that its left-hand operand be a variable name and that its right-hand operator be a literal, you could easily do the appropriate transformations during a top-down traverse of the parse tree. I'm pretty sure that is the approach taken by most similar parsers.

    By the principle of separation of concerns, each component in a language processor should restrict itself, as much as possible, to a single piece of the puzzle. Try to avoid the temptation to prematurely do analysis which could more comfortably be delayed to an appropriate future phase. You'll find that all of the logic is simpler if you work according to this pattern.