I am trying to build a parser for a JS-like language with some unfortunate quirks. This language is provided to us by a vendor, so I do not have control over the syntax.
A simplified version of a statement takes three basic forms: Assign (or compare!) one property to another
obj.prop1 = obj.prop2
Quoted text - could be either single quotes or double quotes
obj.prop1 = "quoted text"
obj.prop1 = 'quoted text'
Unquoted text - could be one or more words, whitespace, and punctuation
obj.prop1 = text
obj.prop1 = multiple words and punctuation $1,000?#-, etc
This case does not allow dots as far as I can tell For case 1, I can determine assignment or comparison externally based on context.
Cases 1 and 2 are relatively easy to deal with since they are clearly distinguished. Unfortunately, case 3 is easily confused with case 1 and I haven't been able to separate them out.
To make matters worse, this pattern repeats in a sub-expression, for example, filtering an array looks like this: obj.arrayProperty[property = value]
.
I've looked at semantic predicates and lexer modes, but haven't found a working solution via either path yet. That isn't to say there isn't one - just that I haven't found one.
Condensed versions of my lexer and parser are attached.
Is there a way for me to distinguish unquoted text from a property access?
I've tried various combinations of lexer modes and semantic predicates.
I also tried doing two passes through the parser. The first assumes quoted or unquoted text after the operator, and the second tries to treat the resulting text as a property access.
Lexer:
lexer grammar CondensedLexer;
AssignmentOrComparisonOperator: ('<' | '<=' | '=' | '!=' | '>' | '>=') -> pushMode(RHS);
AdditiveOperator: '+' | '-';
Identifier
: IdentifierStartCharacter IdentifierPartCharacter*
;
IntegerLiteral: Sign? DecimalDigitCharacter DecimalDigitCharacter*;
Dot: '.';
fragment DecimalDigit: '0'..'9';
fragment Sign: '+' | '-' ;
fragment LetterCharacter
// Category Letter, all subcategories; category Number, subcategory letter.
: [\p{L}\p{Nl}]
;
fragment DecimalDigitCharacter
// Category Number, subcategory decimal digit.
: [\p{Nd}]
;
fragment IdentifierStartCharacter
: LetterCharacter
| UnderscoreCharacter
;
fragment IdentifierPartCharacter
: LetterCharacter
| DecimalDigitCharacter
;
fragment UnderscoreCharacter
: '_' // underscore
| '\\u005' [fF] // Unicode escape sequence for underscore
;
/*
* these rules are important because they allow for insignificant white space to the left of the operator,
* while retaining white space to the right of the operator. I'm sure there are better ways to do this,
* and I'm happy to see them, but that is not my focus at the moment
*/
IWS: WS -> skip; // Insignificant white space
WS: [ \t\r\n]; // significant white space
NEWLINE: '\r'? '\n';
// RHS = right hand side - this mode allows for unquoted strings,
// single quoted strings, quoted strings, and, member accesses
mode RHS;
fragment DoubleQuote: '"';
fragment SingleQuote: '\'';
StringLiteral
: QuotedString
| UnquotedString;
QuotedString
: DoubleQuotedString
| SingleQuotedString
;
fragment DoubleQuotedString:
DoubleQuote RegularStringLiteralCharacter* DoubleQuote;
fragment SingleQuotedString:
SingleQuote RegularStringLiteralCharacter* SingleQuote;
fragment RegularStringLiteralCharacter
: SingleRegularStringLiteralCharacter
| SimpleEscapeSequence
;
fragment SingleRegularStringLiteralCharacter
// anything but ", \, and NewLineCharacter
: ~["\\\u000D\u000A\u0085\u2028\u2029]
;
fragment SimpleEscapeSequence
: '\\\'' | '\\"' | '\\\\' | '\\0' | '\\a' | '\\b' |
'\\f' | '\\n' | '\\r' | '\\t' | '\\v'
;
//Word: [A-Za-z_] [a-zA-Z0-9!@#$%^&*()[\]\\/ \t,|{}<>?`~]+;
UnquotedString: ~('.')+? EOF;
Parser:
parser grammar CondensedParser;
options {
tokenVocab=CondensedLexer;
}
statement
: assignmentOrComparison EOF;
assignmentOrComparison
: memberAccess IWS* AssignmentOrComparisonOperator IWS* restOfLine;
memberAccess
: memberAccess Dot Identifier
| Identifier
| IntegerLiteral // e.g. member.0
;
restOfLine
: memberAccess
| StringLiteral
;
A few notes:
IWS: WS -> skip; // Insignificant white space
WS: [ \t\r\n]; // significant white space
\r
and \n
will be caught by the IWS
Lexer rule. We also just want to hide the WS from the parser, but not exclude it from the tokenSttream, so we'll use channel(HIDDEN)
instead of skip
so:
WS: [ \t] -> channel(HIDDEN); // Insignificant white space
NEWLINE: '\r'? '\n';
next:
StringLiteral: QuotedString | UnquotedString;
Lexer rules don't work like parser rules, so this just treats QuotedString
and UnquotedString
as fragments; you'll only see StringLiteral
tokens. (so we'll delete it, and the Lexer rule UnquotedString
will be moved to the Parser Rule unquotedString
.
next:
UnquotedString: ~('.')+? EOF;
You really don't want EOF
in a Lexer rule. I suspect you meant NEWLINE
, but we'll handle that differently.
We also need a token recognition of all the characters that weren't previously tokenized but could be part of an unquotedString
RandomToken: .;
Note: careful, we only want to match one character and it needs to be at the end of the Lexer rules, to keep it from interfering with other possible rule matches.
I'll assume you want more than one assignmentOrComparison
in a source file, so:
statement: assignmentOrComparison EOF;
becomes
statements: assignmentOrComparison* EOF;
We are skipping WS
and the NEWLINE
seems to be important as the statement terminator, so we change:
assignmentOrComparison
: memberAccess IWS* AssignmentOrComparisonOperator IWS* restOfLine
;
to
assignmentOrComparison
: memberAccess AssignmentOrComparisonOperator restOfLine NEWLINE
;
The main change is to the restOfLine
rule...
restOfLine: memberAccess | StringLiteral;
becomes
restOfLine: memberAccess | QuotedString | unquotedString;
unquotedString
: (
memberAccess
| QuotedString
| AdditiveOperator
| RandomToken
)*
;
Now with the input file:
obj.prop1 = obj.prop2
obj.prop1 = "quoted text"
obj.prop1 = 'quoted text'
obj.prop1 = text
obj.prop1 = multiple words and punctuation $1,000?#-, etc
(Sorry about the resolution, you can click on it and then zoom in to see it in detail)
this recognizes obj.prop1 = text
as:
So you'll have to use external information to determine if it's really an unquotedString
And finally, obj.prop1 = multiple words and punctuation $1,000?#-, etc
has the parse tree:
That's a lot of tokens (and ignored whitespace), but, from the unquotedString
parent node, you can use the getSourceInterval()
method to get the Interval
and then use the public String getText(Interval interval);
method on your token stream, to get the text of everything in your source code from the first token in the unquotedString
to the last (and can just ignore the child nodes). Careful... you want to get the text for the interval, not for the Context object itself. The Context object will just concatenate the text of each child node, while going to the TokenStream with the interval will include all the characters between the start and end of the source for that node.
The resulting Lexer and Parser source:
lexer grammar CondensedLexer
;
AssignmentOrComparisonOperator
: ('<' | '<=' | '=' | '!=' | '>' | '>=')
;
AdditiveOperator: '+' | '-';
Identifier: IdentifierStartCharacter IdentifierPartCharacter*;
IntegerLiteral
: Sign? DecimalDigitCharacter DecimalDigitCharacter*
;
Dot: '.';
fragment DecimalDigit: '0' ..'9';
fragment Sign: '+' | '-';
fragment LetterCharacter
// Category Letter, all subcategories; category Number, subcategory letter.
: [\p{L}\p{Nl}]
;
fragment DecimalDigitCharacter
: [\p{Nd}]
; // Category Number, subcategory decimal digit.
fragment IdentifierStartCharacter
: LetterCharacter
| UnderscoreCharacter
;
fragment IdentifierPartCharacter
: LetterCharacter
| DecimalDigitCharacter
;
fragment UnderscoreCharacter
: '_' // underscore
| '\\u005' [fF] // Unicode escape sequence for underscore
;
WS: [ \t] -> channel(HIDDEN); // Insignificant white space
NEWLINE: '\r'? '\n';
fragment DoubleQuote: '"';
fragment SingleQuote: '\'';
QuotedString: DoubleQuotedString | SingleQuotedString;
fragment DoubleQuotedString
: DoubleQuote RegularStringLiteralCharacter* DoubleQuote
;
fragment SingleQuotedString
: SingleQuote RegularStringLiteralCharacter* SingleQuote
;
fragment RegularStringLiteralCharacter
: SingleRegularStringLiteralCharacter
| SimpleEscapeSequence
;
fragment SingleRegularStringLiteralCharacter
// anything but ", \, and NewLineCharacter
: ~["\\\u000D\u000A\u0085\u2028\u2029]
;
fragment SimpleEscapeSequence
: '\\\''
| '\\"'
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
RandomToken: .;
parser grammar CondensedParser
;
options {
tokenVocab = CondensedLexer;
}
statements: assignmentOrComparison* EOF;
assignmentOrComparison
: memberAccess AssignmentOrComparisonOperator restOfLine NEWLINE
;
memberAccess
: memberAccess Dot Identifier
| Identifier
| IntegerLiteral // e.g. member.0
;
restOfLine: memberAccess | QuotedString | unquotedString;
unquotedString
: (
memberAccess
| QuotedString
| AdditiveOperator
| RandomToken
)*
;
Sample driver and listener:
import java.io.IOException;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
public class Test {
public static void main(String... args) throws IOException {
new Test().run(CharStreams.fromFileName("./condensed.txt"));
}
public void run(CharStream charStream) {
var lexer = new CondensedLexer(charStream);
var tokenStream = new CommonTokenStream(lexer);
var parser = new CondensedParser(tokenStream);
var listener = new TestListener(tokenStream);
var tree = parser.statements();
ParseTreeWalker.DEFAULT.walk(listener, tree);
}
}
import org.antlr.v4.runtime.CommonTokenStream;
public class TestListener extends CondensedParserBaseListener {
CommonTokenStream tokenStream;
TestListener(CommonTokenStream tstream) {
tokenStream = tstream;
}
@Override
public void enterUnquotedString(CondensedParser.UnquotedStringContext ctx) {
System.out.println(ctx.getText());
var interval = ctx.getSourceInterval();
System.out.println(tokenStream.getText(interval));
}
}
output:
multiplewordsandpunctuation$1,000?#-,etc
multiple words and punctuation $1,000?#-, etc