Search code examples
antlr4abstract-syntax-tree

ANTLR Grammar to get a Sentence as Single Token


I am trying to parse PlantUML Sequence diagram using ANTLR grammar.
I am able to generate an AST as per my need.
But the only issue I am facing is while extracting the activity name.

PlantUML SequenceDiagram Grammar :

grammar SequenceDiagram;

uml:
    '@startuml'
    ('autonumber')?
    ('hide footbox')?
    (NEWLINE | sequence_diagram)
    '@enduml'
    ;
    
sequence_diagram:
    (node | relationship | NEWLINE)*
    ;

node:
    ('actor' | 'entity') ident 'as' ident;
 
relationship:
    action 
    (NEWLINE* note)?
    ;

action:
    left=ident 
    arrow 
    right=ident 
    ':' 
    lable=ident
    ;

note:
    'note' direction ':' ident
    ;

ident:
    IDENT;
    
label:
    ident (ident)+ ~(NEWLINE);

direction:
    ('left'|'right');

arrow:
    ('->'|'-->'|'<-'|'<--');

IDENT : NONDIGIT ( DIGIT | NONDIGIT )*;

NEWLINE  :   [\r\n]+ -> skip ;

COMMENT :
    ('/' '/' .*? '\n' | '/*' .*? '*/') -> channel(HIDDEN)
    ;
WS  :   [ ]+ -> skip ; // toss out whitespace

fragment NONDIGIT : [_a-zA-Z];
fragment DIGIT :  [0-9];
fragment UNSIGNED_INTEGER : DIGIT+;

Sample SequenceDiagram Code :

@startuml
actor Alice as al
entity Bob as b

Alice -> Bob: Authentication_Request
Bob --> Alice: Authentication_Response
Alice -> Bob: Another_Authentication_Request
Alice <-- Bob: Another_Authentication_Response
note right: example_note

@enduml

Generated AST :

enter image description here

Do note that the labels -
Authentication_Request, Authentication_Response, etc. are a single word (my workaround).
I would like to use them as space separated - "Authentication Request", "Authentication Response" etc.

I am unable to figure out how to get them as a single token.

Any help would be appreciated.

Edit 1 :

How do I extract the description part in the actor and usecase declarations : Need to extract Chef, "Food Critic", "Eat Food", ..., "Drink", ..., Test1

package Professional {
  actor Chef as c
  actor "Food Critic" as fc
}
package Restaurant {
  usecase "Eat Food" as UC1
  usecase "Pay for Food" as UC2
  usecase "Drink" as UC3
  usecase "Review" as UC4
  usecase Test1
}
SOLUTION for the above edit:

node:
    ('actor' | 'usecase') (ident | description) 'as'? ident?;

description:
    DESCRIPTION;

DESCRIPTION: '"' ~["]* '"';


Solution

  • Perhaps use the ‘:’ and EOL as delimiters. (Looking at the PlantUML site, this seems to be how it’s used (at least for sequence diagrams).

    You’d need to drop the ’:‘ part of your action rule (and strip the leading : when using your LABELS token). You could avoid this with a Lexer mode, but that seems like overkill.

    The plantUML site includes this example:

    @startuml
    Alice -> Alice: This is a signal to self.\nIt also demonstrates\nmultiline \ntext
    @enduml
    

    So you'll need to be pretty flexible about what you accept in the LABEL token. (it's not just one or more IDENTs), so I'm using a rule that just picks up everything from the ':' until the EOL.

    action:
        left=ident 
        arrow 
        right=ident 
        LABEL
        ;
    
    LABEL: ‘:’ ~[\r\n]*;