Search code examples
commentsocamlcamlp4ppx

Lexer/filter for comments


Is there an OCaml tool that allows filtering comments in source files, similar to gcc -E?

Ideally, I'm looking for something that will remove everything but comments, but the other way around would also be useful.

For instance, if there is a way to use camlp4/campl5/ppx to obtain OCaml comments (including non-OCamldoc comments defined with a single asterisk), I would like to know. I haven't had much success looking for comment nodes in Camlp4's AST (though I know it must exist, because there are even bugs related to the fact that Camlp4 modifies their placement).

Here's an example: in the following file:

(*** three asterisks *)
let f () =
  Format.printf "end"

let () =
  (* one asterisk (* nested comment *) *)
  Printf.printf "hello world\n";
  (** two asterisks *)
  f();
  ()

I'd like to ideally obtain:

(*** three asterisks *)
(* one asterisk (* nested comment *) *)
(** two asterisks *)

The whitespace between them and the presence or absence of (* *) are mostly irrelevant, but it should preserve comments of all kinds. My immediate purpose is to be able to filter it to a spell checker, but cleaning comments (i.e. having a filter that strips comments only) could also be useful: I could clean the comments and then use diff to obtain what has been removed.


Solution

  • Well, there is now a lexer based on ocamlwc that strips everything but the comments in the code, called ocaml-comment-sieve. It is based on the simple lexer used in ocamlwc.

    However, this tool is GPL-licensed (because it is derived from ocamlwc, which is GPL-licensed), so it cannot be posted here. Still, it does satisfy my requirements, so until someone suggests a better way, I'll consider it as an answer.