parsing ocaml lexical-analysis ambiguity

Pattern Matching in dypgen

I want handle some ambiguities in dypgen. I found something in the manual, that I want to know, how I can use that. In the manual point 5.2 "Pattern matching on Symbols" there is an example:

expr:
| expr OP<"+"> expr { $1 + $2 }
| expr OP<"*"> expr { $1 * $2 }

OP is matched with "+" or "*", as I understand. I also find there:

The patterns can be any Caml patterns (but without the keyword when). For instance this is possible:
expr: expr<(Function([arg1;arg2],f_body)) as f> expr
{ some action }

So I tried to put there some other expressions, but I dont understand, what happens. If I put in there printf it outputs the value of the matched string. But if I put in there (fun x -> printf x), that seems to me the same as printf, dypgen complains about a syntax error and points to the end of the expression. If I put Printf.printf in there, it complains about Syntax error: operator expected. And if I put there (fun x -> Printf.printf x) it says: Lexing failed with message: lexing: empty token What do these different error-messages mean?

In the end I would like to look up something in a hashtable, if the value is in there, but I don't know, if it is possible this way. Is it or isn't it possible?

EDIT: A minimal example derived from the forest-example from the dypgen-demos.

The grammarfile forest_parser.dyp contains:

{
open Parse_tree
let dyp_merge = Dyp.keep_all
}

%start main
%layout [' ' '\t']

%%

main : np "." "\n" { $1 }

np:
  |    sg                   {Noun($1)}
  |    pl                   {Noun($1)}

sg: word    <Word("sheep"|"fish")>  {Sg($1)}
sg: word    <Word("cat"|"dog")>  {Sg($1)}
pl: word    <Word("sheep"|"fish")>  {Pl($1)}
pl: word    <Word("cats"|"dogs")>  {Pl($1)}

/* OR try:
    sg: word    <printf>  {Sg($1)}
    pl: word    <printf>  {Pl($1)}
*/

word: 
  | (['A'-'Z' 'a'-'z']+)    {Word($1)}

The forest.ml has the following print_forest-function now:

let print_forest forest =
  let rec aux1 t = match t with
    | Word x
    -> print_string x
    | Noun (x) -> (
        print_string "N [";
        aux1 x;
        print_string " ]")
    | Sg (x) -> (
        print_string "Sg [";
        aux1 x;
        print_string " ]")
    | Pl (x) -> (
        print_string "Pl [";
        aux1 x;
        print_string " ]")
  in
  let aux2 t = aux1 t; print_newline () in
  List.iter aux2 forest;
  print_newline ()

And the parser_tree.mli contains:

type tree = 
  | Word        of string
  | Noun        of tree
  | Sg          of tree
  | Pl          of tree

And then you can determine, what numeri fish, sheep, cat(s) etc. are.

sheep or fish can be singular and plural. cats and dogs cannot.

fish.
N [Sg [fish ] ]
N [Pl [fish ] ]

Solution

I know nothing about Dypgen so I tried to figure it out.

Let's see what I found out.

In the parser.dyp file you can define the lexer and the parser or you can use an external lexer. Here's what I did :

My ast looks like this :

parse_prog.mli

type f = 
  | Print of string
  | Function of string list * string * string

type program = f list

prog_parser.dyp

{
  open Parse_prog

  (* let dyp_merge = Dyp.keep_all *)    

  let string_buf = Buffer.create 10
}

%start main

%relation pf<pr

%lexer

let newline = '\n'
let space = [' ' '\t' '\r']
let uident = ['A'-'Z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
let lident = ['a'-'z']['a'-'z' 'A'-'Z' '0'-'9' '_']*

rule string = parse
  | '"' { () }
  | _ { Buffer.add_string string_buf (Dyp.lexeme lexbuf);
      string lexbuf }

main lexer =
  newline | space + -> { () }
  "fun"  -> ANONYMFUNCTION { () }
  lident -> FUNCTION { Dyp.lexeme lexbuf }
  uident -> MODULE { Dyp.lexeme lexbuf }
  '"' -> STRING { Buffer.clear string_buf;
                  string lexbuf;
                  Buffer.contents string_buf }

%parser

main : function_calls eof                                          
   { $1 }

function_calls:
  |                                                                
    { [] }
  | function_call ";" function_calls                               
    { $1 :: $3 }

function_call:
  | printf STRING                                                  
    { Print $2 } pr
  | "(" ANONYMFUNCTION lident "->" printf lident ")" STRING        
    { Print $6 } pf
  | nested_modules "." FUNCTION STRING                             
    { Function ($1, $3, $4) } pf
  | FUNCTION STRING                                                
    { Function ([], $1, $2) } pf
  | "(" ANONYMFUNCTION lident "->" FUNCTION lident ")" STRING      
    { Function ([], $5, $8) } pf

printf:
  | FUNCTION<"printf">                                             
    { () }
  | MODULE<"Printf"> "." FUNCTION<"printf">                        
    { () }

nested_modules:
  | MODULE                                       
    { [$1] }
  | MODULE "." nested_modules                    
    { $1 :: $3 }

This file is the most important. As you can see, if I have a function printf "Test" my grammar is ambiguous and this can be reduced to either Print "Test" or Function ([], "printf", "Test") but !, as I realized, I can give priorities to my rules so if one as a higher priority it will be the one chosen for the first parsing. (try to uncomment let dyp_merge = Dyp.keep_all and you'll see all the possible combinations).

And in my main :

main.ml

open Parse_prog

let print_stlist fmt sl =
  match sl with 
    | [] -> ()
    | _ -> List.iter (Format.fprintf fmt "%s.") sl

let print_program tl =
  let aux1 t = match t with
      | Function (ml, f, p) -> 
        Format.printf "I can't do anything with %a%s(\"%s\")@." print_stlist ml f p
      | Print s -> Format.printf "You want to print : %s@." s
  in
  let aux2 t = List.iter (fun (tl, _) -> 
     List.iter aux1 tl; Format.eprintf "------------@.") tl in
  List.iter aux2 tl

let input_file = Sys.argv.(1)

let lexbuf = Dyp.from_channel (Forest_parser.pp ()) (Pervasives.open_in input_file)

let result = Parser_prog.main lexbuf

let () = print_program result

And, for example, for the following file :

test

printf "first print";
Printf.printf "nested print";
Format.eprintf "nothing possible";
(fun x -> printf x) "Anonymous print";

If I execute ./myexec test I will get the following prompt

You want to print : first print
You want to print : nested print
I can't do anything with Format.eprintf("nothing possible")
You want to print : x
------------

So, TL;DR, the manual example was just here to show you that you can play with your defined tokens (I never defined the token PRINT, just FUNCTION) and match on them to get new rules.

I hope it's clear, I learned a lot with your question ;-)

[EDIT] So, I changed the parser to match what you wanted to watch :

{
      open Parse_prog

      (* let dyp_merge = Dyp.keep_all *)

      let string_buf = Buffer.create 10
    }

    %start main

    %relation pf<pp

    %lexer

    let newline = '\n'
    let space = [' ' '\t' '\r']
    let uident = ['A'-'Z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
    let lident = ['a'-'z']['a'-'z' 'A'-'Z' '0'-'9' '_']*

    rule string = parse
      | '"' { () }
      | _ { Buffer.add_string string_buf (Dyp.lexeme lexbuf);
          string lexbuf }

    main lexer =
      newline | space + -> { () }
      "fun"  -> ANONYMFUNCTION { () }
      lident -> FUNCTION { Dyp.lexeme lexbuf }
      uident -> MODULE { Dyp.lexeme lexbuf }
      '"' -> STRING { Buffer.clear string_buf;
                      string lexbuf;
                      Buffer.contents string_buf }

    %parser

    main : function_calls eof                                          
       { $1 }

    function_calls:
      |                                                                
        { [] } pf
      | function_call <Function((["Printf"] | []), "printf", st)> ";" function_calls
        { (Print st) :: $3 } pp
      | function_call ";" function_calls                               
        { $1 :: $3 } pf


    function_call:
      | nested_modules "." FUNCTION STRING                          
        { Function ($1, $3, $4) }
      | FUNCTION STRING                             
        { Function ([], $1, $2) }
      | "(" ANONYMFUNCTION lident "->" FUNCTION lident ")" STRING
        { Function ([], $5, $8) }

    nested_modules:
      | MODULE                                       
        { [$1] }
      | MODULE "." nested_modules                    
        { $1 :: $3 }

Here, as you can see, I don't handle the fact that my function is print when I parse it but when I put it in my functions list. So, I match on the algebraic type that was built by my parser. I hope this example is ok for you ;-) (but be warned, this is extremely ambiguous ! :-D)