Search code examples
cparsinglisptokenizes-expression

Tokenizing an s-expression in C


I'm trying to create my own Lisp interpreter and have run into some issues with the parsing of s-expressions. My initial thought was to tokenize the expression and handle one bit at a time. I came across some code to do this after failing with my own attempts, however I am confused with it's output.

int lex(const char *str, const char **start, const char **end)
{
    const char *ws = " \t\r\n";
    const char *delim = "() \t\r\n";
    const char *prefix = "()'`";

    str += strspn(str, ws);

    if (str[0] == '\0') {
        *start = *end = NULL;
        return 1;
    }

    *start = str;

    if (strchr(prefix, str[0]) != NULL)
        *end = *start + 1;
    else
        *end = *start + strcspn(str, delim);

    return 0;
}

Usage:

const char *input = "(foo bar 17 '(a b c) 2)";

char *token;
char *p = input;

lex(p, &token, &p);

while(token != NULL)
{
    printf("%.*s\n", (int)(p - input), token);
    lex(p, &token, &p);
}

Output:

(
foo 
bar 17 '
17 '(a b c)
'(a b c) 2)
(a b c) 2)
a b c) 2)
b c) 2)
c) 2)
) 2)
2)
)

Looking at the code, I had expected it, for example, to output 17 and not 17 '(a b c) or to output 2 and not 2). What is causing this and how can I fix it? I'm also open to advice if tokenization isn't the best solution in this case.

On a second note, is a parameter like str absolutely necessary? Would the start and end parameters not be sufficient, for no data prior to start is necessary?


Solution

  • Simple typo.

     printf("%.*s\n", (int)(p - input), token);
    

    Should be

     printf("%.*s\n", (int)(p - token), token);
    

    str is an input argument and start and end are output arguments. You could make start an inout argument, but not everyone likes those.

    In any case, the returned token starts at start and its length is end - start, which is why the printf length argument needs to be p - token.