Prolog DCG: Writing programming language lexer

I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.

All my little token parsers seem to be working alright; at the moment this is a snippet of my code:

:- use_module(library(dcg/basics)).

operator('(')  --> "(".      operator(')')  --> ")".
operator('[')  --> "[".      operator(']')  --> "]".
% ... etc.

keyword(array)    --> "array".
keyword(break)    --> "break".
% ... etc.

It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:

id(id(Id)) -->
    [C],
    {
        char_type(C, alpha)
    },
    idRest(Rest),
    {
        atom_chars(Id, [C|Rest])
    }.
idRest([C|Rest]) -->
    [C],
    {
        char_type(C, alpha) ; char_type(C, digit) ; C = '_'
    },
    idRest(Rest).
idRest([]) --> [].

int(int(Int)) --> integer(Int).

string(str(String)) -->
    "\"",
    stringContent(Codes),
    "\"",
    {
        string_chars(String, Codes)
    }.
stringContent([C|Chars]) -->
    stringChar(C), stringContent(Chars).
stringContent([]) --> [].

stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].

The main rule for my tokenizer is this:

token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

It's not perfect; I will see int get parsed into in,id(t) because keyword(X) comes before id(X). So I guess that's the first question.

The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:

skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.

comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.

token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0 from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.

As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?

The whole code can be found in this haste.

Solution

I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...

:- dynamic text_length/1.

parse_conf_cs(Cs, AST)   :-
    length(Cs, TL),
    retractall(text_length(_)),
    assert(text_length(TL)),
    phrase(cfg(AST), Cs).
....
%%  tag(?T, -X, -Y)// is det.
%
%   Start/Stop tokens for XML like entries.
%   Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
    pos(X), unquoted(T), pos(Y).
....

%%  pos(-C, +P, -P) is det.
%
%   capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.

tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...

edit

a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:

1 ?- code_type(0'a, csymf).
true.

2 ?- code_type(0'1, csymf).
false.

so (glossing over literal transformation)

id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).

id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].

depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic

id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).

% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].

this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like

keyword(K) --> id(id(K)), {memberchk(K, [
    array,
    break,
...
]}.

will correctly identify

?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;

Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':

stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.

It's an example of 'report error and insert missing token', so the parsing can go on...