Search code examples
parsingocamlfrontendocamllexocamlyacc

Record tokens and their position to use them outside the front-end


I want to write a small beautifier for a specific language. In the beautifier, we will be able to indent one or several lines (ie, adding white-spaces on the left hand of each line); we will also be able to format the whole code (ie, alter white-spaces and newlines in appropriate locations).

Given a program, my front-end by ocamllex and ocamlyacc could build a Abstract Syntax Tree (AST):

(* in main.ml *)
let f = open_in file in
let buf = Lexing.from_channel f in
let ast = Parser.main Lexer.token buf in
analyse ast
...

I am familier with working on the AST to analyse, compile and print (not exactly the same) the program. However, it seems that we need to work directly on tokens to write a good beautifier. But I don't know how to manipulate tokens outside the front-end.

For example, is it common to record tokens and their position somewhere while parsing, so that we could still use them outside the front-end? For example, we may go through tokens in this record one by one, and print exactly the same program (including exact white spaces)?

Does anyone have any code snippet?

Edit 1: Here are some examples that use Lexing.lexeme_start_p on lexbuf runtime. However, what I want to know is whether and how people get these information outside (or after) a parsing? For instance, outside (or after) a parsing, how could we get the token from a position?

  (* in main.ml *)
  let ast = try Parser.main Lexer.token buf with
    | Lexer.Lexing_error e ->
      let pos = Lexing.lexeme_start_p buf in
      let l = pos.pos_lnum in
      let c = pos.pos_cnum - pos.pos_bol + 1 in
      pffo "File \"%s\", line %d, characters %d-%d:\n" file l (c-1) c
      pffo "Unexpected exception, parser top : lexical analysis > %s@." e;
      exit 1
    ...

 (* in lexer.mll *)
 rule token = parse 
   ...
   | "'" '\\' (_ as c)
     { let msg = Printf.sprintf "illegal escape sequence \\%c" c in
       let p = Lexing.lexeme_start_p lexbuf in
       raise (Lexical_error (msg, p.Lexing.pos_fname, p.Lexing.pos_lnum, 
              p.Lexing.pos_cnum - p.Lexing.pos_bol + 1)) }

Solution

  • Keeping token positions with tokens is pretty common in practical programming language implementations.

    The easiest way to print out a part of input code as-is is keep the input text somewhere and extract the part you want using the token locations. Rebuilding the text from the stream of tokens and its positions inserting white spaces appropriately is hard to implement and very error prone I am afraid, and impossible when your lexer ignores non white space thing like comments.

    Such an example of printing input code as-is can be found in OCaml compiler implementation. For example Location.highlight_dumb tries to print the code around an error using lexer's lex_buffer field which carries the input text, though sometimes it is impossible since lex_buffer does not keep whole the input.