Search code examples
compiler-constructionocamlocamllexmenhir

How to properly tokenize over whitespaces with ocamllex/Menhir?


I am writing a simple shell in OCaml. My grammar looks a little like this at the moment

/*
  program: command*
  command: WORD WORD* redirection*
  redirection: NUMBER? > FILENAME | < FILENAME
*/

Simple enough. I have a lexer file that looks like this:

let whitespace = [' ' '\t']+
let word = [^ '<' '>']+
let filename = [^ '\x00']+
let number = ['0' - '9']+
let newline = '\n' | "\r\n"

rule token = parse
| whitespace | newline { token lexbuf }
| number as lxm { NUMBER(int_of_string lxm) }
| word as lxm { WORD lxm }
| filename as lxm { FILENAME lxm }
| eof { EOF } 
| _ as lxm { raise @@ SyntaxError("Unexpected char" ^ (String.make 1 lxm)) }

And a Menhir parser (likely not the problem but for completeness)

%{
    open Ast
%}

%token <int> NUMBER
%token <string> WORD
%token <string> FILENAME
%token LEFTARROW
%token RIGHTARROW
%token EOF
%start program

%type <Ast.redirection> redirection
%type <Ast.command> command
%type <Ast.program> program

%%

redirection:
| LEFTARROW f = FILENAME { 0, f }
| opt = option(NUMBER) RIGHTARROW f = FILENAME { 
    match opt with
    | None -> 1, f
    | Some n -> n, f
 }

command:
| executable = WORD args = list(WORD) redirections = list(redirection) { {executable; args; redirections} }

program:
| commands = list(command) EOF { commands }

I have defined a utility function parse_string that looks like this

let parse_string program_string =
  let lexbuf = Lexing.from_string program_string in
  Parser.program Lexer.token lexbuf

For completeness, here is my AST

type redirection = int * string
  
type command = {
  executable : string;
  args : string list;
  redirections : redirection list;
}

type program = command list

Now, given the input string "ls -l" I would expect the result to be {executable = "ls"; args = ["-l"]; redirections = []}

However I get a single executable string, "ls -l"

My lexer is not correctly tokenizing my input into separate strings. It is clumping my input into a single WORD definition.

So my question is, using ocamllex (with Menhir as a parser) how can I enforce strings to tokenize on whitespace?

EDIT: Since I wanted to get something working quickly I followed the advice below and changed my word rule to [^ '<' '>' ' ' '\t'] to avoid whitespace. I also ran into a problem where a string "foo bar" was not being tokenized into 2 words. The problem was my filename rule; it turns out that ocamllex does not apply the rules in the order that the match specifies but always uses the longest matching rule. So "foo bar" was parsed into a single filename. I've removed the filname rule completely for now, the right thing to do is probably to only allow spaces in file names if they are quoted strings.


Solution

  • When the lexer encounters "ls -l", it matches the word category which is defined by a regex that greedily reads everything until the end, because the space character is part of the language defined by the word regex. There are rules to disambiguate multiple matches with the same prefix, but you are not in a situation where they can help you.

    I see two solutions: you make the rule for word reject whitespace, the same way you reject < and >, which is still a bit too universal maybe (you'd have to add exceptions as your syntax gets more complex), or have the Parser be responsible for aggregating letter token into commands. This plays nicely with the existing rules (redirection, whitespace) assuming that letters have lower priorities than special characters in your grammar.