Search code examples
f#ocamlocamllexfsyaccfslex

FsLex FsYacc: How to create a language with a multi-line comment


I am playing around with FsLex and FsYacc, which is based off of ocamllex and ocamlyacc. What is the best way to define a comment in a language? Do I create a comment token in my lex file? There are a few complications to to comments that I cannot wrap my head around in the context of a grammar:

  1. A comment can be placed literally anywhere in the grammar and should be ignored.
  2. A comment can have literally anything in it including other tokens and invalid code.
  3. Comments can span many lines, and I need to maintain the source code position for the debugger. In FsLex and ocamllex, this has to be done by the language developer.

Solution

  • Since you include the ocaml tag I'll answer for ocamllex.

    It's true that handling comments is difficult, especially if your language wants to be able to comment out sections of code. In this case, the comment lexer has to look for (a reduced set of) tokens inside comments, so as not to be fooled by comment closures appearing in quoted context. It also means that the lexer should follow the nesting of comments, so commented-out comments don't confuse things.

    The OCaml compiler itself is an example of this approach. Comment handling for the OCaml compiler has three parts. The first-level lexing rule looks like this:

    rule main = parse
    
        . . . code omitted here . . .
    
        | "(*"
          { comment_depth := 1;
            handle_lexical_error comment lexbuf;
            main lexbuf }
    

    The second level consists of the function handle_lexical_error and the function comment. The former evaluates a lexing function while catching a specific exception. The latter is the detailed lexing function for comments. After the lexing of the comment, the code above goes back to regular lexing (with main lexbuf).

    The function comment looks like this:

    rule comment = parse
        "(*"
        { incr comment_depth; comment lexbuf }
      | "*)"
        { decr comment_depth;
          if !comment_depth = 0 then () else comment lexbuf }
      | '"'
        { reset_string_buffer();
          string lexbuf;
          reset_string_buffer();
          comment lexbuf }
      | "'"
        { skip_char lexbuf ;
         comment lexbuf }
      | eof
        { raise(Lexical_error("unterminated comment", "", 0, 0)) }
      | '\010'
        { incr_loc lexbuf 0;
          comment lexbuf }
      | _
        { comment lexbuf }
    

    So, yes, it's pretty complicated to do a good job.

    For the last point, ocamllex tracks source code positions for you automatically. You can retrieve them from the lexbuf. See the OCaml Lexing module. (However, note that the comment lexing function above adjusts the position when it lexes a newline. The incr_loc function increments the tracked line number.)

    I'm not sure how closely F# tracks this design, but hopefully this will be helpful.

    Update

    Here is the string lexing function:

    rule string = parse
        '"'
        { () }
       | '\\' ("\010" | "\013" | "\013\010") ([' ' '\009'] * as spaces)
        { incr_loc lexbuf (String.length spaces);
          string lexbuf }
      | '\\' (backslash_escapes as c)
        { store_string_char(char_for_backslash c);
          string lexbuf }
      | '\\' (['0'-'9'] as c) (['0'-'9'] as d) (['0'-'9']  as u)
        { let v = decimal_code c d u in
          if in_pattern () && v > 255 then
           warning lexbuf
            (Printf.sprintf
              "illegal backslash escape in string: `\\%c%c%c'" c d u) ;
          store_string_char (Char.chr v);
          string lexbuf }
     | '\\' 'x' (['0'-'9' 'a'-'f' 'A'-'F'] as d) (['0'-'9' 'a'-'f' 'A'-'F'] as u)   
        { store_string_char (char_for_hexadecimal_code d u) ;
          string lexbuf }
      | '\\' (_ as c) 
        {if in_pattern () then
           warning lexbuf
            (Printf.sprintf "illegal backslash escape in string: `\\%c'" c) ;
          store_string_char '\\' ;
          store_string_char c ;
          string lexbuf }
      | eof
        { raise(Lexical_error("unterminated string", "", 0, 0)) }
      | '\010'
        { store_string_char '\010';
          incr_loc lexbuf 0;
          string lexbuf }
      | _ as c
        { store_string_char c;
          string lexbuf }
    

    If you want to know more, you can find the full OCaml lexer source here: lexer.mll.