Search code examples
bashparsingcompiler-constructionantlrlexer

why does a comma "," get counted in [.] type expression in antlr lexer


I am making a grammar for bash scripts. I am facing a problem while tokenising the "," symbol. The following grammar tokenises it as <BLOB> while I expect it to be tokenised as <OTHER>.

grammar newgram;

code                : KEY (BLOB)+   (EOF | '\n')+;

KEY                 : 'wget';

BLOB                : [a-zA-Z0-9@!$^%*&+-.]+?;

OTHER               : .;

However, if I make BLOB to be [a-zA-Z0-9@!$^%*&+.-]+?;, then it is tokenised as <OTHER>.

I cannot understand why is it happening like this.

In the former case, the characters : and / are also tokenised as <OTHER>, so I do not see a reason for ,, to be marked <BLOB>.

Input I am tokenising, wget -o --quiet https,://www.google.com The output I am receiving with the mentioned grammar,

[@0,0:3='wget',<'wget'>,1:0]
[@1,4:4=' ',<OTHER>,1:4]
[@2,5:5='-',<BLOB>,1:5]
[@3,6:6='o',<BLOB>,1:6]
[@4,7:7=' ',<OTHER>,1:7]
[@5,8:8='-',<BLOB>,1:8]
[@6,9:9='-',<BLOB>,1:9]
[@7,10:10='q',<BLOB>,1:10]
[@8,11:11='u',<BLOB>,1:11]
[@9,12:12='i',<BLOB>,1:12]
[@10,13:13='e',<BLOB>,1:13]
[@11,14:14='t',<BLOB>,1:14]
[@12,15:15=' ',<OTHER>,1:15]
[@13,16:16='h',<BLOB>,1:16]
[@14,17:17='t',<BLOB>,1:17]
[@15,18:18='t',<BLOB>,1:18]
[@16,19:19='p',<BLOB>,1:19]
[@17,20:20='s',<BLOB>,1:20]
[@18,21:21=',',<BLOB>,1:21]
[@19,22:22=':',<OTHER>,1:22]
[@20,23:23='/',<OTHER>,1:23]
[@21,24:24='/',<OTHER>,1:24]
[@22,25:25='w',<BLOB>,1:25]
[@23,26:26='w',<BLOB>,1:26]
[@24,27:27='w',<BLOB>,1:27]
[@25,28:28='.',<BLOB>,1:28]
[@26,29:29='g',<BLOB>,1:29]
[@27,30:30='o',<BLOB>,1:30]
[@28,31:31='o',<BLOB>,1:31]
[@29,32:32='g',<BLOB>,1:32]
[@30,33:33='l',<BLOB>,1:33]
[@31,34:34='e',<BLOB>,1:34]
[@32,35:35='.',<BLOB>,1:35]
[@33,36:36='c',<BLOB>,1:36]
[@34,37:37='o',<BLOB>,1:37]
[@35,38:38='m',<BLOB>,1:38]
[@36,39:39='\n',<'
'>,1:39]
[@37,40:39='<EOF>',<EOF>,2:0]
line 1:4 extraneous input ' ' expecting BLOB
line 1:7 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:15 extraneous input ' ' expecting {<EOF>, '
', BLOB}
line 1:22 extraneous input ':' expecting {<EOF>, '
', BLOB}

Solution

  • As already mentioned in a comment, the - in +-. inside your character class is interpreted as a range operator. And the , is inside that range. Escape it like this: [a-zA-Z0-9@!$^%*&+\-.]+?

    Also, a trailing [ ... ]+? at the end of a lexer rule will always match a single character. So [a-zA-Z0-9@!$^%*&+\-.]+? can just as well be written as [a-zA-Z0-9@!$^%*&+\-.]