Search code examples
sublimetext2racketgrammartextmatetextmatebundles

Matching nested constructs in TextMate / Sublime Text / Atom language grammars


While writing a grammar for Github for syntax highlighting programs written in the Racket language, I have stumbled upon a problem.

In Racket #| starts a multiline comment and |# ends it.

The problem is that multiline comments can be nested:

  #| a comment  #| still a comment |# even 
                                      more comment |#

Here is my non-working attempt:

repository:
  multilinecomment: 
    begin:         \#\|
    end:           \|\#
    name:          comment
    contentName:   comment
    patterns:
    - include:     "#multilinecomment"
      name:        comment
    - match:       ([^\|]|\|(?=[^#]))*
      name:        comment

The intent of the match patterns are:

  1. "#multilinecomment" A multiline comment can contain another multiline comment.
  2. ([^\|]|\|(?=[^#]))* The meaning of the subexpressions:

     [^\|]        any characters not an `|`
     \|(?=[^#])   an `|` followed by a non-`#`
    

The entire expression thus matches a string not containg |#

Update:

Got an answer from Allan Odgaard on the TextMate mailing list:

http://textmate.1073791.n5.nabble.com/TextMate-grammars-and-nested-multiline-comments-td28743.html


Solution

  • So I've tested a bunch of languages in Sublime that have multiline comments (C/C++, Java, HTML, PHP, JavaScript), and none of the language syntaxes support multiline comments embedded in multiline comments - the syntax highlighting for the comment scope ends with the first "comment close" marker, not with symmetric markers. Now, this isn't to say that it's impossible, because the BracketHighlighter plugin works great for matching symmetric tags, brackets, and other markers. However, it's written in Python, and uses custom logic for its matching algorithms, something that may not be available in the Oniguruma engine that powers Sublime's syntax highlighter, and apparently Github's as well.

    Basically, from your description of the problem, you need a code parser to ensure that nested comments are legal, something you can't do with just a syntax highlighting definition. If you're writing this just for Sublime, a custom plugin could take care of that, but I don't know enough about Github's Linguist syntax highlighting system to say if you're allowed to do that. I'm not a regex master yet, but it seems to me that it would be rather difficult to achieve this purely by regex, as you'd need to somehow keep track of an arbitrary number of internal symmetric "open" and "close" markers before finding (and identifying!) the final one.

    Sorry I couldn't provide a definitive answer other than I'm not sure this is possible, but that's the best I can come up with without knowing more about Sublime's and Github's internals, something that (at least in Sublime's case) won't happen unless it's open-sourced. Good luck!