I am learning how to use PEGKit, but am running into problem with creating a grammar for a script that parses lines, even when they are separated by multiple line break characters. I have reduced the problem to this grammar:
expr
@before {
PKTokenizer *t = self.tokenizer;
self.silentlyConsumesWhitespace = NO;
t.whitespaceState.reportsWhitespaceTokens = YES;
self.assembly.preservesWhitespaceTokens = YES;
}
= Word nl*;
nl = nl_char nl_char*;
nl_char = '\n'! | '\r'!;
This simple grammar to me should allow one word per line, with as many line breaks as necessary. But it only allows one word with an optional line break. Does anybody know what's wrong here? Thank you.
Creator of PEGKit here.
Try the following grammar instead (make sure you are using HEAD of master):
@before {
PKTokenizer *t = self.tokenizer;
[t.whitespaceState setWhitespaceChars:NO from:'\\n' to:'\\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\\r' to:'\\r'];
[t setTokenizerState:t.symbolState from:'\\n' to:'\\n'];
[t setTokenizerState:t.symbolState from:'\\r' to:'\\r'];
}
lines = line+;
line = ~eol* eol+; // note the `~` Not unary operator. this means "zero or more NON eol tokens, followed by one or more eol token"
eol = '\n'! | '\r'!;
Note that here, I am tweaking the tokenizer to recogognize newlines and carriage returns as Symbol
s rather than whitespace. That makes it easier to match and discard them (they are discarded by the !
operator).
For another approach to the same problem using the builtin S
whitespace rule, see here.