error-handling ocaml abstract-syntax-tree menhir

Associate AST nodes with token locations in source file in menhir

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.

A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?

Solution

You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,

type node = 
  | Typedef of typdef
  | Typeexp of typeexp
  | Literal of string
  | Constant of int
  | ...

type annotated_node = { node : node; loc : loc}

Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,

 match node with
 | {node=Typedef t} -> pp_typedef t
 | ...

Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by @Isabelle Newbie.

A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you, technically, externalize node attributes. The drawback is that you can't guarantee the totality of an attribute since finite maps are inherently non-total. Thus, preserving an invariant that, for example, all nodes have a location is harder.

Since every heap-allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap-allocated objects without wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.

Of course, by saying an address, I do not mean the physical or virtual address of an object. OCaml uses a moving GC so the actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.

Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:

let c1 = Literal "hello"
let c2 = Constant 42

Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)

module Locations = Ephemeron.K1.Make(struct
   type t = node
   let hash = Hashtbl.hash (* or your own hash if you have one *)
   let equal = (==) (* aka the physical equality *)
end)

The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,

let locations = Locations.create 1337

(* let's assume semantics actions produced different constants `c1` and `c2` which are structurally equal, e.g., *)

# let c1 = Constant 42 and c2 = Constant 42 

(* with each constant we can associate its location *)

Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"

And later, you can extract the location:

# Locations.find locs c1;;
- : string = "hello.ml:12:32"

# Locations.find locs c2;;
- : string = "hello.ml:13:56"

As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nicely and easily, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:

let normalize = function
  | Literal str -> Literal (normalize_literal str)
  | node -> node

It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.

Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.

Concerning the "monadic approach" that you mentioned in the comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) To attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not unique in a program run but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). Although it looks like generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as purely functional, as it doesn't contain observable effects.