Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11"
would be [tkSym("aa"), tkNum(11)]
.
My first attempt was the code below:
whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym]) --> letter(Sym).
digit(Dg) --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg]) --> digit(Dg).
token(tkSym(Token)) --> symbol(Token).
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.
tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].
Calling tokenize
on "aa bb"
leaves me with several possible responses:
?- tokenize(X, "aa bb", []).
X = [tkSym([97|97]), tkSym([98|98])] ;
X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
false.
In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:
whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].
symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([]) --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
% similarly for numbers
token(tkSym(Token)) --> symbol(Token).
tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].
But there is a problem: although the single answer to token
called on "aa"
is now a nice list, the tokenize
predicate ends up in an infinite recursion:
?- token(X, "aa", []).
X = tkSym([97, 97]).
?- tokenize(X, "aa", []).
ERROR: Out of global stack
What am I missing? How is the problem usually solved in Prolog?
The underlying problem is that in your second version, token//1
also succeeds for the "empty" token:
?- phrase(token(T), "").
T = tkSym([]).
Therefore, unintentionally, the following succeeds too, as does an arbitrary number of tokens:
?- phrase((token(T1),token(T2)), "").
T1 = T2, T2 = tkSym([]).
To fix this, I recommend you adjust the definitions so that a token must consist of at least one lexical element, as is also typical. A good way to ensure that at least one element is described is to split the DCG rules into two sets. For example, shown for symbol///1
:
symbol([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([L|Ls]) --> letter(L), symbol_r(Ls).
symbol_r([]) --> [].
This way, you avoid an unbounded recursion that can endlessly consume empty tokens.
Other points:
Always use phrase/2
to access DCGs in a portable way, i.e., independent of the actual implementation method used by any particular Prolog system.
The []
in the final DCG clause is superfluous, you can simply remove it.
Also, avoid using so many !/0
. It is OK to commit to the first matching tokenization, but do it only at a single place, like via a once/1
wrapped around the phrase/2
call.
For naming, see my comment above. I recommend to use tokens//1
to make this more declarative. Sample queries, using the above definition of symbol//1
:
?- phrase(tokens(Ts), "").
Ts = [].
?- phrase(tokens(Ls), "a").
Ls = [tkSym([97])].
?- phrase(tokens(Ls), "a b").
Ls = [tkSym([97]), tkSym([98])].