I am writing a front-end to parse a set of txt
files, each file contains a set of procedures
, for instance one txt file looks like:
Sub procedure1
...
End Sub
Sub procedure2
...
End Sub
...
syntax.ml
contains:
type ev = procedure_declaration list
type procedure_declaration =
{ procedure_name : string; procedure_body : procedure_body }
type procedure_body = ...
...
parser.mly
looks like:
%start main
%type <Syntax.ev> main
%%
main: procedure_declarations EOF { List.rev $1 }
procedure_declarations:
/* empty */ { [] }
| procedure_declarations procedure_declaration { $2 :: $1 }
procedure_declaration:
SUB name = procedure_name EOS
body = procedure_body
END SUB EOS
{ { procedure_name = name; procedure_body = body } }
...
Now, I would like to retrieve the parsing of procedure_declaration
(for the purpose of exception handling). That means, I want to create parser_pd.mly
and lexer_pd.mll
, and let parser.mly
call parser_pd.main
. Therefore, parser_pd.mly
looks like:
%start main
%type <Syntax.procedure_declaration> main
%%
main: procedure_declaration EOF { $1 };
...
As most of the content in previous parser.mly
should be moved into parser_pd.mly
, parser.mly
now should be much lighter than before and look like:
%start main
%type <Syntax.ev> main
%%
main: procedure_declarations EOF { List.rev $1 }
procedure_declarations:
/* empty */ { [] }
| procedure_declarations procedure_declaration { $2 :: $1 }
procedure_declaration:
SUB name = procedure_name EOS
??????
END SUB EOS
{ { procedure_name = name;
procedure_body = Parser_pd.main (Lexer_pd.token ??????) } }
The question is I don't know how to write the ??????
part, and lexer.mll
which should be light (as it only reads token END
, SUB
and EOS
, and lets contents treated by lexer_pd.mll
). Maybe some functions from the Lexing
module are needed?
Hope my question is clear... Could anyone help?
You write that you want to retrieve the parsing of procedure_declaration, but in your code, you only want to retrieve a procedure_body, so I'm assuming that's what you want.
To put into my own words, you want to have to compose grammars without telling the embedding grammar which grammar is embedded. The problem (no problem in your case, because you luckily have a very friendly grammar) with this is that in LALR(1), you need one token of lookahead to decide which rule to take. Your grammar looks like this:
procedure_declaration:
SUB procedure_name EOS
procedure_body
END SUB EOS
You can combine procedure_name and procedure_body, so your rule and semantic action will look like:
procedure_declaration:
SUB combined = procedure_name EOS /* nothing here */ EOS
{ { procedure_name = fst combined; procedure_body = snd combined; } }
procedure_name:
id = IDENT {
let lexbuf = _menhir_env._menhir_lexbuf in
(id, Parser_pd.main Lexer_pd.token lexbuf)
}
Parser_pd will contain this rule:
main: procedure_body END SUB { $1 }
You will very likely want END SUB in Parser_pd, because procedure_body is likely not self-delimiting.
Note that you call the sub-parser before parsing the first EOS after the procedure name identifier, because that is your lookahead. If you call it in EOS, it is too late, and the parser will have pulled a token from the body, already. The second EOS is the one after END SUB.
The _menhir_env
thing is obviously a hack that only works with menhir.
You may need another hack to make menhir --infer
work (if you use that),
because that doesn't expect a user to refer to it, so the symbol won't be
in scope. That hack would be:
%{
type menhir_env_hack = { _menhir_lexbuf : Lexing.lexbuf }
let _menhir_env = { _menhir_lexbuf = Lexing.from_function
(* Make sure this lexbuf is never actually used. *)
(fun _ _ -> assert false) }
%}