I am playing around with FsLex and FsYacc, which is based off of ocamllex and ocamlyacc. What is the best way to define a comment in a language? Do I create a comment token in my lex file? There are a few complications to to comments that I cannot wrap my head around in the context of a grammar:
Since you include the ocaml
tag I'll answer for ocamllex
.
It's true that handling comments is difficult, especially if your language wants to be able to comment out sections of code. In this case, the comment lexer has to look for (a reduced set of) tokens inside comments, so as not to be fooled by comment closures appearing in quoted context. It also means that the lexer should follow the nesting of comments, so commented-out comments don't confuse things.
The OCaml compiler itself is an example of this approach. Comment handling for the OCaml compiler has three parts. The first-level lexing rule looks like this:
rule main = parse
. . . code omitted here . . .
| "(*"
{ comment_depth := 1;
handle_lexical_error comment lexbuf;
main lexbuf }
The second level consists of the function handle_lexical_error
and the function comment
. The former evaluates a lexing function while catching a specific exception. The latter is the detailed lexing function for comments. After the lexing of the comment, the code above goes back to regular lexing (with main lexbuf
).
The function comment
looks like this:
rule comment = parse
"(*"
{ incr comment_depth; comment lexbuf }
| "*)"
{ decr comment_depth;
if !comment_depth = 0 then () else comment lexbuf }
| '"'
{ reset_string_buffer();
string lexbuf;
reset_string_buffer();
comment lexbuf }
| "'"
{ skip_char lexbuf ;
comment lexbuf }
| eof
{ raise(Lexical_error("unterminated comment", "", 0, 0)) }
| '\010'
{ incr_loc lexbuf 0;
comment lexbuf }
| _
{ comment lexbuf }
So, yes, it's pretty complicated to do a good job.
For the last point, ocamllex
tracks source code positions for you automatically. You can retrieve them from the lexbuf. See the OCaml Lexing
module. (However, note that the comment lexing function above adjusts the position when it lexes a newline. The incr_loc
function increments the tracked line number.)
I'm not sure how closely F# tracks this design, but hopefully this will be helpful.
Update
Here is the string
lexing function:
rule string = parse
'"'
{ () }
| '\\' ("\010" | "\013" | "\013\010") ([' ' '\009'] * as spaces)
{ incr_loc lexbuf (String.length spaces);
string lexbuf }
| '\\' (backslash_escapes as c)
{ store_string_char(char_for_backslash c);
string lexbuf }
| '\\' (['0'-'9'] as c) (['0'-'9'] as d) (['0'-'9'] as u)
{ let v = decimal_code c d u in
if in_pattern () && v > 255 then
warning lexbuf
(Printf.sprintf
"illegal backslash escape in string: `\\%c%c%c'" c d u) ;
store_string_char (Char.chr v);
string lexbuf }
| '\\' 'x' (['0'-'9' 'a'-'f' 'A'-'F'] as d) (['0'-'9' 'a'-'f' 'A'-'F'] as u)
{ store_string_char (char_for_hexadecimal_code d u) ;
string lexbuf }
| '\\' (_ as c)
{if in_pattern () then
warning lexbuf
(Printf.sprintf "illegal backslash escape in string: `\\%c'" c) ;
store_string_char '\\' ;
store_string_char c ;
string lexbuf }
| eof
{ raise(Lexical_error("unterminated string", "", 0, 0)) }
| '\010'
{ store_string_char '\010';
incr_loc lexbuf 0;
string lexbuf }
| _ as c
{ store_string_char c;
string lexbuf }
If you want to know more, you can find the full OCaml lexer source here: lexer.mll.