Search code examples
javaclojureantlrantlr4ebnf

How will I parse a tag with space in its value using antlr?


I have the following grammer.

meta 
    : '<' NAME '>' TEXT '</' NAME '>'
    | '<' NAME S* attribute* '>';

dl : '<' NAME '><' TEXT '>' dt* '</' NAME '><' TEXT '>';

dt : '<' NAME '><' NAME S* attribute* S* '>' TEXT '</' NAME '>';

attribute : attributeName '=' attributeValue;

attributeName : NAME;

attributeValue : VAL;

NAME : [A-Z0-9_-]+;

VAL : '"'.*?'"';

TEXT : [A-Za-z0-9:\/\.@\-;\s*]+;

S : [ \t\r\n]+ -> skip;

The string is

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Abcd</TITLE>
<H1>Abcd</H1>
<DL><p>
    <DT><H3 ADD_DATE="1481473849" LAST_MODIFIED="1481473992" PERSONAL_XYZ_FOLDER="true">Foo bar</H3>
</DL><p>

I am getting the following error:

ParseError extraneous input 'bar' expecting '</'  clj-antlr.common/parse-error (common.clj:146)

The problem is that the space is skipped so when Foo bar has a space it is giving an error. But if I am not skipping the space I get another error in the META parsing. (The S* is not required when skipping spaces).

ParseError extraneous input ' ' expecting {'>', NAME}
mismatched input '>' expecting '><'
mismatched input '<' expecting {<EOF>, COMMENT, S}  clj-antlr.common/parse-error (common.clj:146)

Here is my tokens file generated by antlr:

T__0=1
T__1=2
T__2=3
T__3=4
T__4=5
DTD=6
COMMENT=7
NAME=8
VAL=9
TEXT=10
S=11
'<'=1
'>'=2
'</'=3
'><'=4
'='=5

And when I run using grun I get the following, but I don't see any errors in the token reported. It is similar to the grammar I defined. How can I accept spaces in tag values?

$ grun MyGrammer r -tokens
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
[@0,0:0='<',<1>,1:0]
[@1,1:4='META',<8>,1:1]
[@2,5:5=' ',<11>,1:5]
[@3,6:15='HTTP-EQUIV',<8>,1:6]
[@4,16:16='=',<5>,1:16]
[@5,17:30='"Content-Type"',<9>,1:17]
[@6,31:31=' ',<11>,1:31]
[@7,32:38='CONTENT',<8>,1:32]
[@8,39:39='=',<5>,1:39]
[@9,40:65='"text/html; charset=UTF-8"',<9>,1:40]
[@10,66:66='>',<2>,1:66]
[@11,67:67='\n',<11>,1:67]
[@12,68:67='<EOF>',<-1>,2:0]
No method for rule r or it has arguments

Thanks.


Solution

  • If you put a space between foo and bar the lexer produces it as two Tokens (of type TEXT) but the grammar states that only one name token is allowed. To solve your problem you simply have to allow a few TEXTs in a seqnece via the plus-operator:

    dt : '<' NAME '><' NAME S* attribute* S* '>' TEXT+ '</' NAME '>';
    

    Also notice that you might run into problems as the Lexer will convert quite a few inputs in NAMEs and not in TEXTs as they both can match the pattern [A-Z0-9]+