Search code examples
parsingantlrsyntax-highlightingantlr4qscintilla

How to highlight QScintilla using ANTLR4?


I'm trying to learn ANTLR4 and I'm already having some issues with my first experiment.

The goal here is to learn how to use ANTLR to syntax highlight a QScintilla component. To practice a little bit I've decided I'd like to learn how to properly highlight *.ini files.

First things first, in order to run the mcve you'll need:

  • Download antlr4 and make sure it works, read the instructions on the main site
  • Install python antlr runtime, just do: pip install antlr4-python3-runtime
  • Generate the lexer/parser of ini.g4:

    grammar ini;
    
    start : section (option)*;
    section : '[' STRING ']';
    option : STRING '=' STRING;
    
    COMMENT : ';'  ~[\r\n]*;
    STRING  : [a-zA-Z0-9]+;
    WS      : [ \t\n\r]+;
    

by running antlr ini.g4 -Dlanguage=Python3 -o ini

  • Finally, save main.py:

    import textwrap
    
    from PyQt5.Qt import *
    from PyQt5.Qsci import QsciScintilla, QsciLexerCustom
    
    from antlr4 import *
    from ini.iniLexer import iniLexer
    from ini.iniParser import iniParser
    
    
    class QsciIniLexer(QsciLexerCustom):
    
        def __init__(self, parent=None):
            super().__init__(parent=parent)
    
            lst = [
                {'bold': False, 'foreground': '#f92472', 'italic': False},  # 0 - deeppink
                {'bold': False, 'foreground': '#e7db74', 'italic': False},  # 1 - khaki (yellowish)
                {'bold': False, 'foreground': '#74705d', 'italic': False},  # 2 - dimgray
                {'bold': False, 'foreground': '#f8f8f2', 'italic': False},  # 3 - whitesmoke
            ]
            style = {
                "T__0": lst[3],
                "T__1": lst[3],
                "T__2": lst[3],
                "COMMENT": lst[2],
                "STRING": lst[0],
                "WS": lst[3],
            }
    
            for token in iniLexer.ruleNames:
                token_style = style[token]
    
                foreground = token_style.get("foreground", None)
                background = token_style.get("background", None)
                bold = token_style.get("bold", None)
                italic = token_style.get("italic", None)
                underline = token_style.get("underline", None)
                index = getattr(iniLexer, token)
    
                if foreground:
                    self.setColor(QColor(foreground), index)
                if background:
                    self.setPaper(QColor(background), index)
    
        def defaultPaper(self, style):
            return QColor("#272822")
    
        def language(self):
            return self.lexer.grammarFileName
    
        def styleText(self, start, end):
            view = self.editor()
            code = view.text()
            lexer = iniLexer(InputStream(code))
            stream = CommonTokenStream(lexer)
            parser = iniParser(stream)
    
            tree = parser.start()
            print('parsing'.center(80, '-'))
            print(tree.toStringTree(recog=parser))
    
            lexer.reset()
            self.startStyling(0)
            print('lexing'.center(80, '-'))
            while True:
                t = lexer.nextToken()
                print(lexer.ruleNames[t.type-1], repr(t.text))
                if t.type != -1:
                    len_value = len(t.text)
                    self.setStyling(len_value, t.type)
                else:
                    break
    
        def description(self, style_nr):
            return str(style_nr)
    
    
    if __name__ == '__main__':
        app = QApplication([])
        v = QsciScintilla()
        lexer = QsciIniLexer(v)
        v.setLexer(lexer)
        v.setText(textwrap.dedent("""\
            ; Comment outside
    
            [section s1]
            ; Comment inside
            a = 1
            b = 2
    
            [section s2]
            c = 3 ; Comment right side
            d = e
        """))
        v.show()
        app.exec_()
    

and run it, if everything went well you should get this outcome:

showcase

Here's my questions:

  • As you can see, the outcome of the demo is far away from being usable, you definitely don't want that, it's really disturbing. Instead, you'd like to get a similar behaviour than all IDEs out there. Unfortunately I don't know how to achieve that, how would you modify the snippet providing such a behaviour?
  • Right now I'm trying to mimick a similar highlighting than the below snapshot:

enter image description here

you can see on that screenshot the highlighting is different on variable assignments (variable=deeppink and values=yellowish) but I don't know how to achieve that, I've tried using this slightly modified grammar:

grammar ini;

start : section (option)*;
section : '[' STRING ']';
option : VARIABLE '=' VALUE;

COMMENT : ';'  ~[\r\n]*;
VARIABLE  : [a-zA-Z0-9]+;
VALUE  : [a-zA-Z0-9]+;
WS      : [ \t\n\r]+;

and then changing the styles to:

style = {
    "T__0": lst[3],
    "T__1": lst[3],
    "T__2": lst[3],
    "COMMENT": lst[2],
    "VARIABLE": lst[0],
    "VALUE": lst[1],
    "WS": lst[3],
}

but if you look at the lexing output you'll see there won't be distinction between VARIABLE and VALUES because order precedence in the ANTLR grammar. So my question is, how would you modify the grammar/snippet to achieve such visual appearance?


Solution

  • The problem is that the lexer needs to be context sensitive: everything on the left hand side of the = needs to be a variable, and to the right of it a value. You can do this by using ANTLR's lexical modes. You start off by classifying successive non-spaces as being a variable, and when encountering a =, you move into your value-mode. When inside the value-mode, you pop out of this mode whenever you encounter a line break.

    Note that lexical modes only work in a lexer grammar, not the combined grammar you now have. Also, for syntax highlighting, you probably only need the lexer.

    Here's a quick demo of how this could work (stick it in a file called IniLexer.g4):

    lexer grammar IniLexer;
    
    SECTION
     : '[' ~[\]]+ ']'
     ;
    
    COMMENT
     : ';' ~[\r\n]*
     ;
    
    ASSIGN
     : '=' -> pushMode(VALUE_MODE)
     ;
    
    KEY
     : ~[ \t\r\n]+
     ;
    
    SPACES
     : [ \t\r\n]+ -> skip
     ;
    
    UNRECOGNIZED
     : .
     ;
    
    mode VALUE_MODE;
    
      VALUE_MODE_SPACES
       : [ \t]+ -> skip
       ;
    
      VALUE
       : ~[ \t\r\n]+
       ;
    
      VALUE_MODE_COMMENT
       : ';' ~[\r\n]* -> type(COMMENT)
       ;
    
      VALUE_MODE_NL
       : [\r\n]+ -> skip, popMode
       ;
    

    If you now run the following script:

    source = """
    ; Comment outside
    
    [section s1]
    ; Comment inside
    a = 1
    b = 2
    
    [section s2]
    c = 3 ; Comment right side
    d = e
    """
    
    lexer = IniLexer(InputStream(source))
    stream = CommonTokenStream(lexer)
    stream.fill()
    
    for token in stream.tokens[:-1]:
        print("{0:<25} '{1}'".format(IniLexer.symbolicNames[token.type], token.text))
    

    you will see the following output:

    COMMENT                   '; Comment outside'
    SECTION                   '[section s1]'
    COMMENT                   '; Comment inside'
    KEY                       'a'
    ASSIGN                    '='
    VALUE                     '1'
    KEY                       'b'
    ASSIGN                    '='
    VALUE                     '2'
    SECTION                   '[section s2]'
    KEY                       'c'
    ASSIGN                    '='
    VALUE                     '3'
    COMMENT                   '; Comment right side'
    KEY                       'd'
    ASSIGN                    '='
    VALUE                     'e'
    

    And an accompanying parser grammar could look like this:

    parser grammar IniParser;
    
    options {
      tokenVocab=IniLexer;
    }
    
    sections
     : section* EOF
     ;
    
    section
     : COMMENT
     | SECTION section_atom*
     ;
    
    section_atom
     : COMMENT
     | KEY ASSIGN VALUE
     ;
    

    which would parse your example input in the following parse tree:

    enter image description here