Search code examples
cssantlrcss-parsing

Parsing CSS with ANTLR - edge cases


I'm trying to parse CSS, or at least the basics, using ANTLR. I'm running into a few problems with my lexer rules though. The problem lies in the ambiguity between an ID selectors and hexadecimal color values. Using a simplified grammar for clarity, consider the following input:

#bbb {
  color: #fff;
}

and the following parser rules:

ruleset : selector '{' property* '}';
selector: '#' ALPHANUM;
property: ALPHANUM ':' value ';' ;
value: COLOR;

and these lexer tokens:

ALPHANUM : ('a'..'z' | '0'..'9')+;
COLOR : '#' ('0'..'9' | 'a'..'f')+;

This will not work, because #bbb is tokenized as a COLOR token, even though it's supposed to be a selector. If I change the selector to not start with a hexadecimal character, it works fine. I'm not sure how to solve this. Is there a way to tell ANTLR to treat a specific token only as a COLOR token if it's in a certain position? Say, if it's in a property rule, I can safely assume it's a color token. If it isn't, treat it as a selector.

Any help would be appreciated!


Solution: Turns out I was trying to do too much in the grammar, which I should probably deal with in the code using the AST. CSS has too many ambiguous tokens to reliably split into different tokens, so the approach I'm using now is basically tokenizing the special characters like '#', '.', ':' and the curly braces, and doing post processing in the consumer code. Works a lot better, and it's easier to deal with the edge cases.


Solution

  • Try moving the # in your lexer file from COLOR to its own thing, as such:

    LLETTERS: ( 'a'..'z' )
    ULETTERS: ( 'A'..'Z' )
    NUMBERS: ( '0'..'9' )
    HASH : '#';
    

    Then, in your parser rules, you can do it like this:

    color: HASH (LLETTERS | ALPHANUM)+;
    selector: HASH (ULETTERS | LLETTERS) (ULETTERS | LLETTERS | NUMBERS)*;
    

    etc.

    This allows you to specify the difference grammatically, which can roughly be described as contextually, versus lexically, which can roughly be described as by appearance. If something's meaning changes depending on where it is, that difference should be specified in the grammar, not the lexer.

    Note that color and selector are quite the same definition. Lexers are typically a separate stage from the module that translates the input string to a grammar, so it is invalid to have an ambiguous lexicon (as was pointed out, bbb could be hex or it could be a lowercase letter string). Thus, data validity checking needs to be done elsewhere.