parsing ocaml lexical-analysis ocamllex ocamlyacc

Retrieve a part of parsing by making separate .mly and .mll

I am writing a front-end to parse a set of txt files, each file contains a set of procedures, for instance one txt file looks like:

Sub procedure1
...
End Sub

Sub procedure2
...
End Sub

...

syntax.ml contains:

type ev = procedure_declaration list
type procedure_declaration = 
  { procedure_name : string; procedure_body : procedure_body }
type procedure_body = ...
...

parser.mly looks like:

%start main
%type <Syntax.ev> main
%%
main: procedure_declarations EOF { List.rev $1 }

procedure_declarations:
  /* empty */ { [] }
| procedure_declarations procedure_declaration { $2 :: $1 }

procedure_declaration:
SUB name = procedure_name EOS
body = procedure_body
END SUB EOS
{ { procedure_name = name; procedure_body = body } }
...

Now, I would like to retrieve the parsing of procedure_declaration (for the purpose of exception handling). That means, I want to create parser_pd.mly and lexer_pd.mll, and let parser.mly call parser_pd.main. Therefore, parser_pd.mly looks like:

%start main
%type <Syntax.procedure_declaration> main
%%
main: procedure_declaration EOF { $1 };
...

As most of the content in previous parser.mly should be moved into parser_pd.mly, parser.mly now should be much lighter than before and look like:

%start main
%type <Syntax.ev> main
%%
main: procedure_declarations EOF { List.rev $1 }

procedure_declarations:
  /* empty */ { [] }
| procedure_declarations procedure_declaration { $2 :: $1 }

procedure_declaration:
SUB name = procedure_name EOS
??????
END SUB EOS
{ { procedure_name = name; 
    procedure_body = Parser_pd.main (Lexer_pd.token ??????) } }

The question is I don't know how to write the ?????? part, and lexer.mll which should be light (as it only reads token END, SUB and EOS, and lets contents treated by lexer_pd.mll). Maybe some functions from the Lexing module are needed?

Hope my question is clear... Could anyone help?

Solution

You write that you want to retrieve the parsing of procedure_declaration, but in your code, you only want to retrieve a procedure_body, so I'm assuming that's what you want.

To put into my own words, you want to have to compose grammars without telling the embedding grammar which grammar is embedded. The problem (no problem in your case, because you luckily have a very friendly grammar) with this is that in LALR(1), you need one token of lookahead to decide which rule to take. Your grammar looks like this:

procedure_declaration:
  SUB procedure_name EOS
  procedure_body
  END SUB EOS

You can combine procedure_name and procedure_body, so your rule and semantic action will look like:

procedure_declaration:
  SUB combined = procedure_name EOS /* nothing here */ EOS
  { { procedure_name = fst combined; procedure_body = snd combined; } }

procedure_name:
  id = IDENT {
    let lexbuf = _menhir_env._menhir_lexbuf in
    (id, Parser_pd.main Lexer_pd.token lexbuf)
  }

Parser_pd will contain this rule:

main: procedure_body END SUB { $1 }

You will very likely want END SUB in Parser_pd, because procedure_body is likely not self-delimiting.

Note that you call the sub-parser before parsing the first EOS after the procedure name identifier, because that is your lookahead. If you call it in EOS, it is too late, and the parser will have pulled a token from the body, already. The second EOS is the one after END SUB.

The _menhir_env thing is obviously a hack that only works with menhir. You may need another hack to make menhir --infer work (if you use that), because that doesn't expect a user to refer to it, so the symbol won't be in scope. That hack would be:

%{
  type menhir_env_hack = { _menhir_lexbuf : Lexing.lexbuf }
  let _menhir_env = { _menhir_lexbuf = Lexing.from_function
    (* Make sure this lexbuf is never actually used. *)
    (fun _ _ -> assert false) }
%}