Search code examples
prologdcg

Tokenizing a string in Prolog using DCG


Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11" would be [tkSym("aa"), tkNum(11)].

My first attempt was the code below:

whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].

letter(Let)     --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym])   --> letter(Sym).

digit(Dg)        --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg])     --> digit(Dg).

token(tkSym(Token)) --> symbol(Token). 
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.

tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].  

Calling tokenize on "aa bb" leaves me with several possible responses:

 ?- tokenize(X, "aa bb", []).
 X = [tkSym([97|97]), tkSym([98|98])] ;
 X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
 X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
 X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
 false.

In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:

whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].

symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([])      --> [].
letter(Let)     --> [Let], { code_type(Let, alpha) }.

% similarly for numbers

token(tkSym(Token)) --> symbol(Token).

tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].

But there is a problem: although the single answer to token called on "aa" is now a nice list, the tokenize predicate ends up in an infinite recursion:

 ?- token(X, "aa", []).
 X = tkSym([97, 97]).

 ?- tokenize(X, "aa", []).
 ERROR: Out of global stack

What am I missing? How is the problem usually solved in Prolog?


Solution

  • The underlying problem is that in your second version, token//1 also succeeds for the "empty" token:

    ?- phrase(token(T), "").
    T = tkSym([]).
    

    Therefore, unintentionally, the following succeeds too, as does an arbitrary number of tokens:

    ?- phrase((token(T1),token(T2)), "").
    T1 = T2, T2 = tkSym([]).
    

    To fix this, I recommend you adjust the definitions so that a token must consist of at least one lexical element, as is also typical. A good way to ensure that at least one element is described is to split the DCG rules into two sets. For example, shown for symbol///1:

    symbol([L|Ls]) --> letter(L), symbol_r(Ls).
    
    symbol_r([L|Ls]) --> letter(L), symbol_r(Ls).
    symbol_r([])     --> []. 
    

    This way, you avoid an unbounded recursion that can endlessly consume empty tokens.

    Other points:

    Always use phrase/2 to access DCGs in a portable way, i.e., independent of the actual implementation method used by any particular Prolog system.

    The [] in the final DCG clause is superfluous, you can simply remove it.

    Also, avoid using so many !/0. It is OK to commit to the first matching tokenization, but do it only at a single place, like via a once/1 wrapped around the phrase/2 call.

    For naming, see my comment above. I recommend to use tokens//1 to make this more declarative. Sample queries, using the above definition of symbol//1:

    ?- phrase(tokens(Ts), "").
    Ts = [].
    
    ?- phrase(tokens(Ls), "a").
    Ls = [tkSym([97])].
    
    ?- phrase(tokens(Ls), "a b").
    Ls = [tkSym([97]), tkSym([98])].