Search code examples
javaparsingjparsec

How do I build a parser out of tokenizers?


I am using to parse strings like:

[1,2, 3]
[ 3, 4]
[3   ,4,56, 7 ]
[]

I have implemented a few classes (inheriting from my Token interface) to represent the tokens:

final class OpenListToken
final class CommaToken
final class CloseListToken
final class NumberToken // Has a public final property "value" that contains the int

I have also implemented tokenizers for each:

static final Parser<OpenListToken> openListTokenParser
static final Parser<CommaToken> commaTokenParser
static final Parser<CloseListToken> closeListTokenParser
static final Parser<NumberToken> numberTokenParser

These all work at a character level. For example:

final NumberToken t = numberTokenParser.parse("123");
// t.value == 123

final OpenListToken u = openListToken.parse("[");
// Succeeds

Now I would like to combine them to make a parser of ListExpression, which is a class than represents a list of numbers. I have tried something like:

openListTokenParser
    .next(numberTokenParser.sepBy(commaTokenParser))
    .followedBy(closeListTokenParser)

This works for strings like [1,2,3] but obviously not for strings like [ 1, 2 ].

Is there an operator that takes some parsers and puts whitespace* between them?

Or is it possible to make my ListExpression parser work on a stream of my Token interface instances instead of characters?


Solution

  • You can directly build a tokenizer using the functions from Terminals class. In your case, this would look like the following:

    First define the set of our terminals, e.g. operators, keywords, words...

    Terminals terminals = operators("[", "]", ",");
    

    Our tokens are then either tokenized by our terminals or the IntegerLiteral tokenizer:

    Parser<?> tokens = Parsers.or(terminals.tokenizer(), IntegerLiteral.TOKENIZER);
    

    Our final results from a syntactic parsers for integers (built from tokens tagged as INTEGER), separated by our comma token, between our brackets token. We ignore any whitespace in between all tokens (this is the second argument to from:

    Parser<?> parser = IntegerLiteral.PARSER.sepBy(terminals.token(",")).between(terminals.token("["), terminals.token("]"))
      .from(tokens, Scanners.WHITESPACES.many().cast());
    

    Et voilà:

    System.out.println(parser.parse( "[1,2,3]"));
    System.out.println(parser.parse( "[ 1, 2 , 3 ]   "));
    System.out.println(parser.parse( "   [1,2,3   ]"));
    System.out.println(parser.parse( "[1, 2   ,    3]"));