Search code examples
antlr4chess

Error when generating a grammar for chess PGN files


I made this ANTLR4 grammar in order to parse a PGN inside my Java programm, but I can't manage to solve the ambiguity in it :

grammar Pgn;

file:       game (NEWLINE+ game)*;
game:       (tag+ NEWLINE+)? notation;

tag:        [TAG_TYPE "TAG_VALUE"];
notation: move+ END_RESULT?;
move:   MOVE_NUMBER\. MOVE_DESC MOVE_DESC   #CompleteMove
        |   MOVE_NUMBER\. MOVE_DESC             #OnlyWhiteMove
        |   MOVE_NUMBER\.\.\. MOVE_DESC         #OnlyBlackMove
        ;

END_RESULT: '1-0'
            | '0-1'
            | '1/2-1/2'
            ;

TAG_TYPE:   LETTER+;
TAG_VALUE:  .*;

MOVE_NUMBER: DIGIT+;
MOVE_DESC: .*;  

NEWLINE:    \r? \n;
SPACES:     [ \t]+ -> skip;

fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];

And this is the error output :

$ antlr4 Pgn.g4 
error(50): Pgn.g4:6:6: syntax error: 'TAG_TYPE "TAG_VALUE"' came as a complete surprise to me while matching alternative

I think the error come from the fact that " [ ", " ] " and ' " ' can't be used freely, neither in Grammar nor Lexer.

Helps or advices are welcome.


Solution

  • Looking at the specs for PGN, http://www.thechessdrum.net/PGN_Reference.txt, I see there's a formal definition of the PGN format there:

    18: Formal syntax
    
    <PGN-database> ::= <PGN-game> <PGN-database>
                       <empty>
    
    <PGN-game> ::= <tag-section> <movetext-section>
    
    <tag-section> ::= <tag-pair> <tag-section>
                      <empty>
    
    <tag-pair> ::= [ <tag-name> <tag-value> ]
    
    <tag-name> ::= <identifier>
    
    <tag-value> ::= <string>
    
    <movetext-section> ::= <element-sequence> <game-termination>
    
    <element-sequence> ::= <element> <element-sequence>
                           <recursive-variation> <element-sequence>
                           <empty>
    
    <element> ::= <move-number-indication>
                  <SAN-move>
                  <numeric-annotation-glyph>
    
    <recursive-variation> ::= ( <element-sequence> )
    
    <game-termination> ::= 1-0
                           0-1
                           1/2-1/2
                           *
    <empty> ::=
    

    I highly recommend you to let your ANTLR grammar resemble that as much as possible. I made a small project with ANTLR 4 on Github which you can try out: https://github.com/bkiers/PGN-parser

    The grammar (without comments):

    parse
     : pgn_database EOF
     ;
    
    pgn_database
     : pgn_game*
     ;
    
    pgn_game
     : tag_section movetext_section
     ;
    
    tag_section
     : tag_pair*
     ;
    
    tag_pair
     : LEFT_BRACKET tag_name tag_value RIGHT_BRACKET
     ;
    
    tag_name
     : SYMBOL
     ;
    
    tag_value
     : STRING
     ;
    
    movetext_section
     : element_sequence game_termination
     ;
    
    element_sequence
     : (element | recursive_variation)*
     ;
    
    element
     : move_number_indication
     | san_move
     | NUMERIC_ANNOTATION_GLYPH
     ;
    
    move_number_indication
     : INTEGER PERIOD?
     ;
    
    san_move
     : SYMBOL
     ;
    
    recursive_variation
     : LEFT_PARENTHESIS element_sequence RIGHT_PARENTHESIS
     ;
    
    game_termination
     : WHITE_WINS
     | BLACK_WINS
     | DRAWN_GAME
     | ASTERISK
     ;
    
    WHITE_WINS
     : '1-0'
     ;
    
    BLACK_WINS
     : '0-1'
     ;
    
    DRAWN_GAME
     : '1/2-1/2'
     ;
    
    REST_OF_LINE_COMMENT
     : ';' ~[\r\n]* -> skip
     ;
    
    BRACE_COMMENT
     : '{' ~'}'* '}' -> skip
     ;
    
    ESCAPE
     : {getCharPositionInLine() == 0}? '%' ~[\r\n]* -> skip
     ;
    
    SPACES
     : [ \t\r\n]+ -> skip
     ;
    
    STRING
     : '"' ('\\\\' | '\\"' | ~[\\"])* '"'
     ;
    
    INTEGER
     : [0-9]+
     ;
    
    PERIOD
     : '.'
     ;
    
    ASTERISK
     : '*'
     ;
    
    LEFT_BRACKET
     : '['
     ;
    
    RIGHT_BRACKET
     : ']'
     ;
    
    LEFT_PARENTHESIS
     : '('
     ;
    
    RIGHT_PARENTHESIS
     : ')'
     ;
    
    LEFT_ANGLE_BRACKET
     : '<'
     ;
    
    RIGHT_ANGLE_BRACKET
     : '>'
     ;
    
    NUMERIC_ANNOTATION_GLYPH
     : '$' [0-9]+
     ;
    
    SYMBOL
     : [a-zA-Z0-9] [a-zA-Z0-9_+#=:-]*
     ;
    
    SUFFIX_ANNOTATION
     : [?!] [?!]?
     ;
    
    UNEXPECTED_CHAR
     : .
     ;
    

    For a version with comments, see: https://github.com/bkiers/PGN-parser/blob/master/src/main/antlr4/nl/bigo/pp/PGN.g4