Search code examples
compiler-constructionocamllexerocamllex

Construction of a lexer with ocamllex without a parser module


I want to build a lexer with ocamllex before the construction of a parser with menhir.

I have wrote the .mll file. The following command gave me this message:

> ocamllex lexer.mll
71 states, 2138 transitions, table size 8978 bytes

Then I typed the following one:

> ocamlc -c lexer.ml
File "lexer.mll", line 48, characters 12-17:
Error: Unbound constructor COMMA

The excerpt of the .ml file :

       ...
(* Lexing Rules *)
rule tokens = parse
  | ',' { COMMA }
  | ';' { SEMICOLON }
  | '(' { LPAREN }
  | ')' { RPAREN }
  | '[' { LBRACKETS }
  | ']' { RBRACKETS }
  | '{' { LBRACE }
  | '}' { RBRACE }
       ...

What I understand is that those actions are not mapped to anything, hence the unbound error.
Finally, what I want to know is how I do these mappings without writing the parser or the .mly file? I'm pretty new to the language and what I want to achieve is a simple lexer built with the ocamllex.


Solution

  • Each action is an OCaml expression, so when you say they aren't mapped to anything what you're really saying is that you're using symbols that aren't defined.

    As @glennsl points out, you can make this work by defining the symbols. At the top of an .mll file before the rules there is a section in curly braces that contains any desired OCaml code. You can define a type in there like this:

    { type token = COMMA | ...
    }
    rule tokens = parse
    | ',' { COMMA }
    

    However, the token type needs to be shared between the lexer and the parser. Hence tokens are usually defined in the parser file and the parser generator (menhir in your case) generates a file that defines the token type. So when you move to the next step of your project you'll probably want to remove the definition of the token type from your lexer file.