Search code examples
lexical-analysisrascal

Processing complex lexicals in Rascal


What's the best practice for dealing with complex literals in Rascal?

Two examples from JavaScript (my DSL has similar cases):

  • Strings with \ escapes - have to be unescaped into actual value.
  • Regular expression literals - need their own sub-AST.

implode refuses to map lexicals to abstract trees, they are obviously handed differently from syntax productions, despite having complete parse trees available. For example, the following parser fails with IllegalArgument("Missing lexical constructor"):

module lexicals

import Prelude;

lexical Char = "\\" ![] | ![\\]; // potentially escaped character
lexical String = "\"" Char* "\""; // if I make this "syntax", implode works as expected

start syntax Expr = string: String;

data EXPR = string(list[str] chars);

void main(list[str] args) {
    str text = "\"Hello\\nworld\"";
    print(implode(#EXPR, parse(#Expr, text)));
}

The only idea I have so far is to capture all lexicals as raw strings and later re-parse them (implode and all) using separately defined syntaxes without layout whitespace. Hopefully, there's a better way.


Solution

  • The way implode converts a parse tree into an ast is document in the rascal tutor:implode. This contains the following rule:

    Unlabeled lexicals are imploded to str, int, real, bool depending on the expected type in the ADT. To implode lexical into types other than str, the PDB parse functions for integers and doubles are used. Boolean lexicals should match "true" or "false". NB: lexicals are imploded this way, even if they are ambiguous.

    So, solution 1 is to add a label to your production:

    lexical String = string: "\"" Char* "\"";
    

    Also, perhaps you do not need to have an AST next to your parse tree? At least not one that has to closely match your grammar. The two common scenario's are:

    1. You need an AST, since the structure of the grammar is unsuited for your purpose. In that case, you have to manually write your implode function.
    2. The structure of your parse tree is good enough. In that case, checkout the example for Concrete Syntax. It is a very clean way work with the target language nested inside rascal.

    We are leaning more and more to deprecating the implode function since our concrete syntax is powerfull enough for most cases.