Search code examples
prologtokenizedcg

SWI-Prolog tokenize_atom/2 replacement?


What I need to do is to break atom to tokens. E. g.:

tokenize_string('Hello, World!', L).

would unify L=['Hello',',','World','!']. Exactly as tokenize_atom/2 do. But when I try to use tokenize_atom/2 with non-latin letters it fails. Is there any universal replacement or how I can write one? Thanks in advance.


Solution

  • Well, you could write your own lexer. For example I can show you a lexer from my arithmetic expressions parser.

    :- use_module(library(http/dcg_basics)).
    
    %
    % lexer
    %
    
    lex([H | T]) -->
        lexem_t(H), !,
        lex(T).
    
    lex([]) -->
        [].
    
    lexem_t(L) --> trashes, lexem(L), trashes.
    
    trashes --> trash, !, trashes.
    trashes --> [].
    
    trash --> comment_marker(End), !, string(_), End.
    trash --> white.
    
    comment_marker("*)") --> "(*".
    comment_marker("*/") --> "/*".
    
    hex_start --> "0X".
    hex_start --> "0x".
    
    lexem(open) --> "(".
    lexem(close) --> ")".
    lexem(+) --> "+".
    lexem(-) --> "-".
    lexem(*) --> "*".
    lexem(/) --> "/".
    lexem(^) --> "^".
    lexem(,) --> ",".
    lexem(!) --> "!".
    
    lexem(N) --> hex_start, !, xinteger(N). % this handles hex numbers
    lexem(N) --> number(N). % this handles integers/floats
    lexem(var(A)) --> identifier_c(L), {string_to_atom(L, A)}.
    
    identifier_c([H | T]) --> alpha(H), !, many_alnum(T).
    
    alpha(H) --> [H], {code_type(H, alpha)}.
    alnum(H) --> [H], {code_type(H, alnum)}.
    
    many_alnum([H | T]) --> alnum(H), !, many_alnum(T).
    many_alnum([]) --> [].
    

    How it works:

     ?- phrase(lex(L), "abc 123 привет 123.4e5 !+- 0xabc,,,"), write(L).
    [var(abc), 123, var(привет), 1.234e+007, !, +, -, 2748, (,), (,), (,)]