Search code examples
antlrwhitespacegrammarwordssentence

Parse sentences with different word types


I'm looking for a grammar for analyzing two type of sentences, that means words separated by white spaces:

  1. ID1: sentences with words not beginning with numbers
  2. ID2: sentences with words not beginning with numbers and numbers

Basically, the structure of the grammar should look like

ID1 separator ID2  

ID1: Word can contain number like Var1234 but not start with a number  

ID2: Same as above but 1234 is allowed  

separator: e. g. '='

@Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word. Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?

grammar Sentence1b1;

tokens
{
  TCUnderscore  = '_' ;
  TCQuote       = '"' ;
}

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  ( Word | Int )+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
  ;

Special
  : ( TCUnderscore | TCQuote )
  ;

Space
  :  ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
  ;

fragment Digit
  :  '0'..'9'
  ;

Lexer-rule Special shall then be used in lexer-rule Word:

Word
  :  ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
  ;

Solution

  • I'd go for something like this:

    grammar Sentence;
    
    assignment
      :  id1 '=' id2
      ;
    
    id1
      :  Word+
      ;
    
    id2
      :  (Word | Int)+
      ;
    
    Int
      :  Digit+
      ;
    
    // A word must start with a letter
    Word
      :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
      ;
    
    Space
      :  (' ' | '\t' | '\r' | '\n') {skip();}
      ;
    
    fragment Digit
      :  '0'..'9'
      ;
    

    which will parse the input:

    Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed

    as follows:

    enter image description here

    EDIT

    To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):

    // wrong!
    Special      : (TCUnderscore | TCQuote);
    TCUnderscore : '_';
    TCQuote      : '"';
    

    Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:

    // wrong!
    TCUnderscore : '_';
    TCQuote      : '"';
    Special      : (TCUnderscore | TCQuote);
    

    the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:

    The following token definitions can never be matched because prior tokens match the same input: ...
    

    If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:

    // good!
    Special               : (TCUnderscore | TCQuote);
    fragment TCUnderscore : '_';
    fragment TCQuote      : '"';
    

    Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).

    // wrong!
    parse : TCUnderscore;
    
    Special               : (TCUnderscore | TCQuote);
    fragment TCUnderscore : '_';
    fragment TCQuote      : '"';