I'm trying to create my own Lisp interpreter and have run into some issues with the parsing of s-expressions. My initial thought was to tokenize the expression and handle one bit at a time. I came across some code to do this after failing with my own attempts, however I am confused with it's output.
int lex(const char *str, const char **start, const char **end)
{
const char *ws = " \t\r\n";
const char *delim = "() \t\r\n";
const char *prefix = "()'`";
str += strspn(str, ws);
if (str[0] == '\0') {
*start = *end = NULL;
return 1;
}
*start = str;
if (strchr(prefix, str[0]) != NULL)
*end = *start + 1;
else
*end = *start + strcspn(str, delim);
return 0;
}
Usage:
const char *input = "(foo bar 17 '(a b c) 2)";
char *token;
char *p = input;
lex(p, &token, &p);
while(token != NULL)
{
printf("%.*s\n", (int)(p - input), token);
lex(p, &token, &p);
}
(
foo
bar 17 '
17 '(a b c)
'(a b c) 2)
(a b c) 2)
a b c) 2)
b c) 2)
c) 2)
) 2)
2)
)
Looking at the code, I had expected it, for example, to output 17
and not 17 '(a b c)
or to output 2
and not 2)
. What is causing this and how can I fix it? I'm also open to advice if tokenization isn't the best solution in this case.
On a second note, is a parameter like str
absolutely necessary? Would the start
and end
parameters not be sufficient, for no data prior to start
is necessary?
Simple typo.
printf("%.*s\n", (int)(p - input), token);
Should be
printf("%.*s\n", (int)(p - token), token);
str
is an input argument and start
and end
are output arguments. You could make start
an inout argument, but not everyone likes those.
In any case, the returned token starts at start
and its length is end - start
, which is why the printf length argument needs to be p - token
.