Search code examples
recursionprologtokenizes-expression

Why does my Prolog S-expression tokenizer fail on its base case?


To learn some Prolog (I'm using GNU Prolog) and grok its parsing abilities, I am starting by writing a Lisp (or S-expression, if I'm being exact) tokenizer, which given a set of tokens like ['(', 'f', 'o', 'o', ')'] should produce ['(', 'foo', ')']. It's not working as expected, which is why I'm here! I thought my thought process shined through in my pseudocode:

tokenize([current | rest], buffer, tokens):
    if current is '(' or ')',
        Tokenize the rest,
        And the output will be the current token buffer,
        Plus the parenthesis and the rest.

    if current is ' ',
        Tokenize the rest with a clean buffer,
        And the output will be the buffer plus the rest.
    
    if the tail is empty,
        The output will be a one-element list containing the buffer.
    
    otherwise,
        Add the current character to the buffer,
        And the output will be the rest tokenized, with a bigger buffer.

I translated that to Prolog like this:

tokenize([Char | Chars], Buffer, Tokens) :-
    ((Char = '(' ; Char = ')') ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer, Char | Tail_Tokens];
    Char = ' ' ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer | Tail_Tokens];

    Chars = [] -> Tokens is [Buffer];

    atom_concat(Buffer, Char, New_Buffer),
    tokenize(Chars, New_Buffer, Tokens)).

print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).

main :-
    % tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
    tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
    print_tokens(Tokens).

When running the result, below, like this: gprolog --consult-file lisp_parser.pl it just tells me no. I traced main, and it gave me the stack trace below. I do not understand why tokenize fails for the empty case. I see that the buffer is empty since it was cleared with the previous ')', but even if Tokens is empty at that point in time, wouldn't Tokens accumulate a larger result recursively? Can someone who is good with Prolog give me a few tips here?

| ?- main.

no
| ?- trace.
The debugger will first creep -- showing everything (trace)

(1 ms) yes
{trace}
| ?- main.
      1    1  Call: main ? 
      2    2  Call: tokenize(['(',f,o,o,')'],'',_353) ? 
      3    3  Call: tokenize([f,o,o,')'],'',_378) ? 
      4    4  Call: atom_concat('',f,_403) ? 
      4    4  Exit: atom_concat('',f,f) ? 
      5    4  Call: tokenize([o,o,')'],f,_429) ? 
      6    5  Call: atom_concat(f,o,_454) ? 
      6    5  Exit: atom_concat(f,o,fo) ? 
      7    5  Call: tokenize([o,')'],fo,_480) ? 
      8    6  Call: atom_concat(fo,o,_505) ? 
      8    6  Exit: atom_concat(fo,o,foo) ? 
      9    6  Call: tokenize([')'],foo,_531) ? 
     10    7  Call: tokenize([],'',_556) ? 
     10    7  Fail: tokenize([],'',_544) ? 
      9    6  Fail: tokenize([')'],foo,_519) ? 
      7    5  Fail: tokenize([o,')'],fo,_468) ? 
      5    4  Fail: tokenize([o,o,')'],f,_417) ? 
      3    3  Fail: tokenize([f,o,o,')'],'',_366) ? 
      2    2  Fail: tokenize(['(',f,o,o,')'],'',_341) ? 
      1    1  Fail: main ? 

(1 ms) no
{trace}
| ?- 

Solution

  • How about this. I think that's what you want to do, but let's use Definite Clause Grammars (which are just horn clauses with :- replaced by --> and two elided arguments holding the input character list and remaining character list. An example DCG rule:

    rule(X) --> [c], another_rule(X), {predicate(X)}.
    

    List processing rule rule//1 says: When you find character c in the input list, then continue list processing with another_rule//1, and when that worked out, call predicate(X) as normal.

    Then:

    % If we encounter a separator symbol '(' or ')', we commit to the
    % clause using '!' (no point trying anything else, in particular
    % not the clause for "other characters", tokenize the rest of the list,
    % and when we have done that decide whether 'MaybeToken', which is 
    % "part of the leftmost token after '(' or ')'", should be retained.
    % it is dropped if it is empty. The caller is then given an empty
    % "part of the leftmost token" and the list of tokens, with '(' or ')'
    % prepended: "tokenize('', [ '(' | MoreTokens] )  -->"
     
    tokenize('', [ '(' | MoreTokens] ) -->
       ['('],
       !,
       tokenize(MaybeToken,Tokens),
       {drop_empty(MaybeToken,Tokens,MoreTokens)}.
       
    tokenize('',[')'|MoreTokens]) --> 
       [')'],
       !,
       tokenize(MaybeToken,Tokens),
       {drop_empty(MaybeToken,Tokens,MoreTokens)}.
       
    % No more characters in the input list (that's what '--> []' says).
    % We succeed, with an empty token list and an empty buffer fro the
    % leftmost token.
    
    tokenize('',[]) --> [].
    
    % If we find a 'Ch' that is not '(' or ')', then tokenize
    % more of the list via 'tokenize(MaybeToken,Tokens)'. On
    % returns 'MaybeToken' is a piece of the leftmost token found
    % in that list, so we have to stick 'Ch' onto its start.
    
    tokenize(LargerMaybeToken,Tokens) --> 
       [Ch],
       tokenize(MaybeToken,Tokens),
       {atom_concat(Ch,MaybeToken,LargerMaybeToken)}.
    
    % ---
    % This drops an empty "MaybeToken". If "MaybeToken" is 
    % *not* empty, it is actually a token and prepended to the list "Tokens"
    % ---
    
    drop_empty('',Tokens,Tokens) :- !.
    drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).
    
    % -----------------
    % Call the DCG using phrase/2
    % -----------------
    
    tokenize(Text,Result) :-
       phrase( tokenize(MaybeToken,Tokens), Text ),
       drop_empty(MaybeToken,Tokens,Result),!.
    

    And so:

    ?- tokenize([h,e,l,l,o],R).
    R = [hello].
    
    ?- tokenize([h,e,l,'(',l,')',o],R).
    R = [hel,(,l,),o].
    
    ?- tokenize([h,e,l,'(',l,l,')',o],R).
    R = [hel,(,ll,),o].
    

    I think in GNU Prolog, the notation `hello` generates [h,e,l,l,o] directly.