Search code examples
javascriptantlr4

Dealing with escape sequences in ANTLR string literals


There are few examples for parsing strings in public antlr4 grammars, e.g. Javascript grammar. This grammar supports escape sequences for strings.

Given the following input:

"\""

it will produce a parse tree with StringLiteral token. The text inside token will be \". But actually end user of a language (who types program, that is later parsed by ANTLR) meant to get " as a result.

My question is how to convert one to another. Should it be programmed inside Antlr grammar somehow? If it is, how to achieve it?

Or it's a task for parse tree interpreter to do this? If yes, how would I achieve it Javascript? I see no good ways except using eval() or Regex, which always frightens me. Maybe there are built-in apis in javascript I don't know about?


Solution

  • There are different approaches possible. I recommend not to touch the text in any way until you really need it. In my code base I have a function which takes any rule context and returns its text. If that rule context is a special text literal context, then it also does unquoting and escape conversion:

    /**
     * Returns the text which the given context matched.
     *
     * @param context The parser context for which to return the text. If that is a text literal, some special
     *                processing takes place to replace escape sequences, double quotes etc.
     *
     * @param convertEscapes Indicates if escape sequences should be handled for text literals.
     *
     * @returns The text for the context.
     */
    export const getText = (context: RuleContext, convertEscapes: boolean): string => {
        if (context instanceof TextLiteralContext) {
            let result = "";
    
            for (let index = 0; index < context.getChildCount(); ++index) {
                const child = context.textStringLiteral(index);
                // eslint-disable-next-line no-underscore-dangle
                const token = child._value;
                if (token.type === MySQLParser.DOUBLE_QUOTED_TEXT || token.type === MySQLParser.SINGLE_QUOTED_TEXT) {
                    let text = token.text || "''";
                    const quoteChar = text[0];
                    const doubledQuoteChar = quoteChar.repeat(2);
                    text = text.substring(1, text.length - 1); // Remove outer quotes.
                    text = text.replace(doubledQuoteChar, quoteChar); // Add replace double quote chars.
    
                    result += text;
    
                    break;
                }
            }
    
            if (convertEscapes) {
                const temp = result;
                result = "";
    
                let pendingEscape = false;
                for (let c of temp) {
                    if (pendingEscape) {
                        pendingEscape = false;
                        switch (c) {
                            case "n": {
                                c = "\n";
                                break;
                            }
                            case "t": {
                                c = "\t";
                                break;
                            }
                            case "r": {
                                c = "\r";
                                break;
                            }
                            case "b": {
                                c = "\b";
                                break;
                            }
                            case "0": {
                                c = "\0";
                                break; // ASCII null
                            }
                            case "Z": {
                                c = "\u0032";
                                break; // Win32 end of file
                            }
    
                            default: {
                                break;
                            }
                        }
                    } else if (c === "\\") {
                        pendingEscape = true;
                        continue;
                    }
                    result += c;
                }
    
                if (pendingEscape) {
                    result += "\\";
                }
            }
    
            return result;
        }
    
        return context.getText(); // In all other cases return the text unprocessed.
    };
    
    

    That TextLiteralContext is from my grammar (MySQL) and you must replace that with the appropriate context from yours, which represents the text literals in your language. If you don't have such a parser rule then I recommend to create one to centralize text literal handling.