I want to write a small beautifier for a specific language. In the beautifier, we will be able to indent one or several lines (ie, adding white-spaces on the left hand of each line); we will also be able to format the whole code (ie, alter white-spaces and newlines in appropriate locations).
Given a program, my front-end by ocamllex
and ocamlyacc
could build a Abstract Syntax Tree (AST)
:
(* in main.ml *)
let f = open_in file in
let buf = Lexing.from_channel f in
let ast = Parser.main Lexer.token buf in
analyse ast
...
I am familier with working on the AST to analyse, compile and print (not exactly the same) the program. However, it seems that we need to work directly on tokens to write a good beautifier. But I don't know how to manipulate tokens outside the front-end.
For example, is it common to record tokens and their position somewhere while parsing, so that we could still use them outside the front-end? For example, we may go through tokens in this record one by one, and print exactly the same program (including exact white spaces)?
Does anyone have any code snippet?
Edit 1:
Here are some examples that use Lexing.lexeme_start_p
on lexbuf
runtime. However, what I want to know is whether and how people get these information outside (or after) a parsing? For instance, outside (or after) a parsing, how could we get the token from a position?
(* in main.ml *)
let ast = try Parser.main Lexer.token buf with
| Lexer.Lexing_error e ->
let pos = Lexing.lexeme_start_p buf in
let l = pos.pos_lnum in
let c = pos.pos_cnum - pos.pos_bol + 1 in
pffo "File \"%s\", line %d, characters %d-%d:\n" file l (c-1) c
pffo "Unexpected exception, parser top : lexical analysis > %s@." e;
exit 1
...
(* in lexer.mll *)
rule token = parse
...
| "'" '\\' (_ as c)
{ let msg = Printf.sprintf "illegal escape sequence \\%c" c in
let p = Lexing.lexeme_start_p lexbuf in
raise (Lexical_error (msg, p.Lexing.pos_fname, p.Lexing.pos_lnum,
p.Lexing.pos_cnum - p.Lexing.pos_bol + 1)) }
Keeping token positions with tokens is pretty common in practical programming language implementations.
The easiest way to print out a part of input code as-is is keep the input text somewhere and extract the part you want using the token locations. Rebuilding the text from the stream of tokens and its positions inserting white spaces appropriately is hard to implement and very error prone I am afraid, and impossible when your lexer ignores non white space thing like comments.
Such an example of printing input code as-is can be found in OCaml compiler implementation. For example Location.highlight_dumb
tries to print the code around an error using lexer's lex_buffer
field which carries the input text, though sometimes it is impossible since lex_buffer
does not keep whole the input.