Search code examples
parsingantlrmediawikiantlr4text-parsing

ANTLR4 parsing a Wiktionary article fails weirdly


I'm trying to parse mediawiki markup, specifically the one used in english wiktionary articles.
It not being a programming language, the handling of whitespace and newlines is kind of weird, plus I feel like every step is trial and (lots of) error.

Here's the repo: https://github.com/WorDB/wikitext-parser

The test input file is the pie article: pie.txt
(https://en.wiktionary.org/wiki/pie)

Note: I'm parsing the whole XML dump of wiktionary, so I'd rather find a solution parsing with Antlr and not get suggestions like using some online API.

wikitext.g4

grammar wikitext;

/**
 Grammar
 */

page: EOL? ((wikitem | bullet_line) EOL? )+ EOF;

wikitem:
      wikitem wikitem
    | title 
    | template
    | link
    | text
    ;

title: title2 | title3 | title4 | title5;
title5: '=====' text '=====';
title4: '====' text '====';
title3: '===' text '===';
title2: '==' text '==';

template: '{{' parameter ('|' parameter)* '}}';
link: '[[' parameter ('|' parameter)* ']]';

parameter: wikitem?; // parameter can be empty, I.E. {{a|}}

bullet: ('*'|'#'|'#:'|'#*');
bullet_line: WS? EOL WS? bullet WS? wikitem;

text: (CHAR | WS)+;

/**
 Lexicon
 */
EOL: [\f\r\n]+;
CHAR: ~[ \t\f\r\n];
WS: [ \t]+;  

Error:

> cd ./java && grun wikitext page -gui ../data/pie.txt

line 190:137 no viable alternative at input 'rom {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'eminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'minine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'inine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'nine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ne of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'e of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'f {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|ine-pro|*'
line 190:137 extraneous input '*' expecting {'|', '}}'}
line 190:146 no viable alternative at input 's)peyk-|'
line 190:146 no viable alternative at input ')peyk-|'
line 190:146 no viable alternative at input 'peyk-|'
line 190:146 no viable alternative at input 'eyk-|'
line 190:146 no viable alternative at input 'yk-|'
line 190:146 no viable alternative at input 'k-|'
line 190:146 no viable alternative at input '-|'
line 190:146 mismatched input '|' expecting {<EOF>, '=====', '====', '===', '==', '{{', '[[', EOL, CHAR, WS}

Solution

  • I have changed some rules. Could you check it?

    grammar wikitext;
    
    /**
     Grammar
     */
    
    page: EOL? (wikitem EOL? )+ EOF;
    
    wikitem:
          wikitem wikitem
        | title
        | template
        | link
        | text
        | bullet_line
        ;
    
    title: title2 | title3 | title4 | title5;
    title5: '=====' text '=====';
    title4: '====' text '====';
    title3: '===' text '===';
    title2: '==' text '==';
    
    template: '{{' parameter ('|' parameter)* '}}';
    link: '[[' parameter ('|' parameter)* ']]';
    
    parameter: wikitem?; // parameter can be empty, I.E. {{a|}}
    
    bullet_line: WS? bullet=('*'|'#'|'#:'|'#*') WS? wikitem;
    
    text: (CHAR | WS)+;
    
    /**
     Lexicon
     */
    EOL: [\f\r\n]+;
    CHAR: ~[ \t\f\r\n];
    WS: [ \t]+;