ocaml frontend lex pretty-print ocamllex

Faithfully handle white-spacing in a pretty-printer

I am writing a front-end for a language (by ocamllex and ocamlyacc).

So the frond-end can build a Abstract Syntax Tree (AST) from a program. Then we often write a pretty printer, which takes an AST and print a program. If later we just want to compile or analyse the AST, most of the time, we don't need the printed program to be exactly the same as the original program, in terms of white-spacing. However, this time, I want to write a pretty printer that prints exactly the same program as the original one, in terms of white-spacing.

Therefore, my question is what are best practices to handle white-spacing while trying not to modify too much the types of AST. I really don't want to add a number (of white-spaces) to each type in the AST.

For example, this is how I currently deal with (ie, skip) white-spacing in lexer.mll:

rule token = parse
  ...
  | [' ' '\t']       { token lexbuf }     (* skip blanks *)
  | eof              { EOF }

Does anyone know how to change this as well as other parts of the front-end to correctly taking white-spacing into account for a later printing?

Solution

It's quite common to keep source-file location information for each token. This information allows for more accurate errors, for example.

The most general way to do this is to keep the beginning and ending line number and column position for each token, which is a total of four numbers. If it were easy to compute the end position of a token from its value and the start position, that could be reduced to two numbers, but at the price of extra code complexity.

Bison has some features which simplify the bookkeeping work of remembering location objects; it's possible that ocamlyacc includes similar features, but I didn't see anything in the documentation. In any case, it is straight-forward to maintain a location object associated with each input token.

With that information, it is easy to recreate the whitespace between two adjacent tokens, as long as what separated the tokens was whitespace. Comments are another issue.

It's a judgement call whether or not that is simpler than just attaching preceding whitespace (and even comments) to each token as it is lexed.