Search code examples
prologdcg

Prolog: simple lexer/2


I need a small lexer/2 in prolog, currently I have

tokens(Z) --> "while", tokens(Y), {Z = [ttwhile | Y]}.
tokens(Z) --> "do", tokens(Y), {Z = [ttdo | Y]}.
tokens(Z) --> "endwhile", tokens(Y), {Z = [ttendwhile | Y]}.
tokens(Z) --> "repeat", tokens(Y), {Z = [ttrepeat | Y]}.
tokens(Z) --> "until", tokens(Y), {Z = [ttuntil | Y]}.
tokens(Z) --> "endrepeat", tokens(Y), {Z = [ttendrepeat | Y]}.
tokens(Z) --> "if", tokens(Y), {Z = [ttif | Y]}.
tokens(Z) --> "then", tokens(Y), {Z = [ttthen | Y]}.
tokens(Z) --> "else", tokens(Y), {Z = [ttelse | Y]}.
tokens(Z) --> "endif", tokens(Y), {Z = [ttendif | Y]}.
tokens(Z) --> "exit", tokens(Y), {Z = [ttexit | Y]}.
tokens(Z) --> "other", tokens(Y), {Z = [ttother | Y]}.

% Comparison operators.
tokens(Z) --> "==", tokens(Y), {Z = [equal | Y]}.
tokens(Z) --> "<>", tokens(Y), {Z = [notequal | Y]}.

% Assignment operator.
tokens(Z) --> ":=", tokens(Y), {Z = [:= | Y]}.  

% Boolean constants and operators.
tokens(Z) --> "true", tokens(Y), {Z = [true | Y]}.  
tokens(Z) --> "false", tokens(Y), {Z = [false | Y]}.  
tokens(Z) --> "and", tokens(Y), {Z = [and | Y]}.  
tokens(Z) --> "or", tokens(Y), {Z = [or | Y]}.  

tokens(Z) --> " ", tokens(Y), {Z = Y}.
tokens(Z) --> " ", tokens(Y), {Z = Y}.

tokens(Z) --> [C], tokens(Y), {name(X, [C]), Z = [X | Y]}.
tokens(Z) --> [], {Z = []}.

Can anyone help me with the next step for lexer/2 so that when I call lexer([while,a,==,b,do,abc,endwhile], R), I could get R = [ttwhile, a, equal, b, ttdo, abc, ttendwhile]?

Thank you very much.


Solution

  • well, this 'glue' - more or less - solves your request:

    lexer(L, Tokens) :-
        atomic_list_concat(L, ' ', A),
        atom_codes(A, Cs),
        phrase(tokens(Tokens), Cs).
    
    ?- lexer([while,a,==,b,do,abc,endwhile], R).
    R = [ttwhile, a, equal, b, ttdo, a, b, c, ttendwhile] ;
    R = [ttwhile, a, equal, b, ttdo, a, b, c, e|...] ;
    

    but you should really rewrite in declarative style:

    token(ttwhile) --> "while".
    token(ttendwhile) --> "endwhile".
    token(ttdo) --> "do".
    %...
    token(equal) --> "==".
    token(notequal) --> "<>".
    token(assign) --> ":=". 
    
    % this is wrong: symbols overlap with alphabetic tokens
    token(N) --> [C], {atom_codes(N,[C])}.
    
    tokens([]) --> [].
    tokens(Ts) --> " ", tokens(Ts).
    tokens([T|Ts]) --> token(T), tokens(Ts).
    
    lexer(Cs, Tokens) :-
        phrase(tokens(Tokens), Cs).
    

    and call passing a codes list, a double quoted (or backquoted, if you're using SWI) string

    ?- lexer(`while abc endwhile`, R).
    R = [ttwhile, a, b, c, ttendwhile] ;
    R = [ttwhile, a, b, c, e, n, d, ttwhile] ;
    ...
    

    edit

    to tokenize names (well, only lowercase, for simplicity), replace the above token(N) --> [C], {atom_codes(N,[C])}. with

    token(N) --> lower_case_chars(Cs), {Cs \= [], atom_codes(N,Cs)}.
    
    lower_case_chars([C|Cs]) --> lower_case_char(C), lower_case_chars(Cs).
    lower_case_chars([]) --> [].
    
    lower_case_char(C) --> [C], {C>=0'a, C=<0'z}.
    

    but it becomes a little verbose, when you add also upper_case_chars, digits, etc... it's worth to generalize, passing the characters range boundary, or use code_type/2:

    token(N) --> csymf(C), csyms(Cs), {atom_codes(N,[C|Cs])}.
    
    csymf(C) --> [C], {code_type(C,csymf)}.
    
    csyms([C|Cs]) --> [C], {code_type(C,csym)}, csyms(Cs).
    csyms([]) --> [].