Search code examples
javaparsingtokenizelexerjparsec

Why this simple jparsec lexer fails?


I would write a simple lexer that recognises words without digits and numbers ignoring whitespaces.

I written the following code using jparsec v3.0:

final Parser<String> words = Patterns.isChar(CharPredicates.IS_ALPHA).many1().toScanner("word").source();
final Parser<String> nums = Patterns.isChar(CharPredicates.IS_DIGIT).many1().toScanner("num").source();
final Parser<Tokens.Fragment> tokenizer = Parsers.or(
        words.map(it -> Tokens.fragment(it, "WORD")),
        nums.map(it -> Tokens.fragment(it, "NUM")));
final Parser<List<Token>> lexer = tokenizer.lexer(Scanners.WHITESPACES);

But the following test fails with the exception org.jparsec.error.ParserException: line 1, column 7: EOF expected, 1 encountered. Instead, using the string "abc cd 123" the parsing is successful.

final List<Token> got = lexer.parse("abc cd123");
final List<Token> expected = Arrays.asList(
        new Token(0, 3, Tokens.fragment("abc", "WORD")),
        new Token(4, 2, Tokens.fragment("cd", "WORD")),
        new Token(6, 3, Tokens.fragment("123", "NUM")));
assertEquals(expected, got);

In your opinion what is wrong?


Solution

  • The problem has been solved simply by making the delimiter optional:

    tokenizer.lexer(Scanners.WHITESPACES.optional(null))