Search code examples
prologdcg

Prolog - parse with DCG


I want to write a DCG that can deal with text that is presented in notepad. I've read some online tutorial about writing DCG but none of which delt with a text that is free and that involves strings,dates and integers. I'm not really sure how to even start writing the DCG (how to represent a line or even a date). Any help?


Solution

  • the 'trick' it's to approach the problem in a declarative way, giving first more selective patterns. Your data format has an apparently well defined columnar structure, and using library(dcg/basics) could be handled:

    :- [library(dcg/basics)].
    
    row([Date,Key|Numerics]) -->
     date(Date), separe, key(Key), separe, numeric_pairs(Numerics).
    
    date(D/M/Y) -->
     integer(D), "/", integer(M), "/", integer(Y).
    
    key([F|Ks]) -->
     [F], {F \= 0' }, string(Ks).
    
    numeric_pairs([Num:Perc|NPs]) -->
     integer(Num), separe, number(Perc), "%", whites, !, numeric_pairs(NPs).
    numeric_pairs([]) --> [].
    
    separe --> white, whites.
    

    test:

    ?- atom_codes('02/18/2014  BATS Z  235122734   6.90%   109183482   10.50%  147587409   7.80%', Cs), phrase(row(R), Cs).
    Cs = [48, 50, 47, 49, 56, 47, 50, 48, 49|...],
    R = [2/18/2014, [66, 65, 84, 83, 32, 90], 235122734:6.9, 109183482:10.5, 147587409:7.8] 
    

    I must say that isn't very easy to debug. When Prolog backtracks you have no hint about what was going wrong... There should be a specialized trace, I guess...

    To feed the DCG, see library(pure_input), or - easier to debug - fetch a line at time, with read_line_to_codes/2

    edit maybe my hit to use read_line_to_codes/2 was a bad one.

    Here is a complete scan of your test data, using phrase_from_file/2 and a subsequent selection of appropriate columns and sum (column as required by argument).

    :- [library(dcg/basics)].
    :- [library(pure_input)].
    
    test(ColToSum, Tot) :-
        phrase_from_file(row(Rows), '/tmp/test.txt'),
        maplist(get_col(ColToSum), Rows, Cols),
        sum_list(Cols, Tot).
    
    get_col(ColToSum, Row, Col) :-
        nth1(ColToSum, Row, Col:_).
    
    row([[Date,Key|Numerics]|Rows]) -->
     date(Date), separe, key(Key), separe, numeric_pairs(Numerics), "\n",
     row(Rows).
    row(Rows) -->
     string(_), "\n",
     row(Rows).
    row([]) --> [].
    
    date(D/M/Y) -->
     integer(D), "/", integer(M), "/", integer(Y).
    
    key([F|Ks]) -->
     [F], {F \= 0' }, string(Ks).
    
    numeric_pairs([Num:Perc|NPs]) -->
     integer(Num), separe, number(Perc), "%", whites, !, numeric_pairs(NPs).
    numeric_pairs([]) --> [].
    
    separe --> white, whites.
    

    that yields

    ?- test(3,X).
    X = 561877153 
    

    If you're using Windows, use "\r\n" as line terminator...

    HTH