Search code examples
regexparsingsplittokenizelexer

Tokenizing a string that could be a tuple or something else


I'm building a lexer (tokenizer) and a parser for a basic programming language, and one of the language features I was thinking about was the option to have a string parsed into a Tuple or an Expression. The problem is I am stuck on how I can determine from a string whether it should be a tuple or not. At first, I thought I could just check if the string starts with '(' and ends with ')' as well as contains a ',' somewhere inside, but the issue with that approach is that Expressions can also contain tuples.

Is there a canonical or accepted way to tell whether a string is a tuple?


Solution

  • The lexer doesn't figure out whether a ( is part of a tuple. It just recognises the (. That's all it has to do.

    Deciding what a particular symbol means is the job of the parser. The parser will use a grammatical description of the language to distinguish between tuples, parenthesised expressions, parameter lists, and all the other possible meanings of parentheses.

    The precise grammar will, of course, depend on the syntax of the language, but a simple example might be: (loosely adapted from Python, but missing lots of syntax)

    expr  : term             /* Additive operators, lowest precedence */
          | expr '+' term
          | expr '-' term
    term  : factor           /* Multiplicative operators */
          | term '*' factor
          | term '/' factor
    factor: postfix          /* Unary prefix operators */
          | '-' factor
          | '+' factor
    post  : unit             /* Postfix operators: call and subscript */
          | post '(' opt_expr_list ')'
          | postfix '[' expr ']'
    unit  : CONSTANT         /* Values */
          | IDENTIFIER
          | tuple
          | '(' expr ')'     /* Parenthesised expression */
    tuple : '(' ')'          /* Empty tuple */
          | '(' expr ',' ')' /* Single-element tuple eg. (1, ) */
          | '(' expr ',' expr_list opt_comma ')'
                             /* Two or more elements */
    opt_comma
          : %empty
          | ','
    expr_list
          : expr
          | expr_list ',' expr
    opt_expr_list
          : %empty
          | expr_list
    

    Note that the syntax has been carefully crafted to allow tuples to be distinguished from parenthesized expressions. The convention that is used is that tuples may be written with a , after the last element unless the tuple has exactly one element, in which case the comma the mandatory. This slightly requires breaking the syntax for tuples into three productions, but it is not particular complicated.

    Note also that it wasn't necessary to jump through this particular hoop when describing the grammar for function calls. It is impossible to confuse the (3) in sin(3) with a parenthesised expression, since parenthesised expressions (like other values) cannot immediately follow an expression without some sort of operator.