frontend lex code-formatting pretty-print ocamllex

Common way to store positions in a formatter

I want to write a small editor for a specific language. In the editor, we will be able to indent one or several lines (ie, adding white-spaces on the left hand of each line); we will also be able to format the whole code (ie, alter white-spaces and newlines in appropriate locations).

Given a program, my front-end by ocamllex and ocamlyacc could build a Abstract Syntax Tree (AST). I would like to know what are common ways to store positions of elements in AST.

One way I guess is to append a (start) position to each element of the AST. For example, if the type of an expression is defined as follow:

type expression =
  ...
  | E_int of int
  | E_function_EEs of Function.t * (expression list)

It will become:

type expression =
  ...
  | E_int of position * int
  | E_function_EEs of position * Function.t * (expression list)

Then, if we know the length of each element, we can infer the position of everything in the editor. Is it a common way to do so? I don't find it nice...

Solution

You don't have to repeat position for each pattern. You could just write one in the end:

type expression = 
  ...
  | E_int of int
  | E_function_EEs of Function.t * (expression list)
  | E_loc of position * expression

As a result, for existing functions over expression, you could just add a case for E_loc, and do not need to touch existing cases.

To construct E_loc automatically while parsing, you could add in .mly for example:

loc(EXPRESSION):
| t = EXPRESSION { E_loc (($startpos, $endpos), t) }

(* immediate construction: *)
expression:
| INT { E_loc (($startpos, $endpos), E_int $1) }

(* or delay construction: *)
expression:
| e0 = loc(expression) PLUS e1 = loc(expression) { E_function_EEs (Function.PLUS, [e0; e1]) }