Search code examples
prologswi-prologdcgfasta

What is the general pattern for creating a dcg for file input?


I always seem to struggle to write DCG's to parse input files. But it seems it should be simple? Are there any tips or tricks to think about this problem?

For a concrete example, lets say I want to parse a fasta file. (https://en.wikipedia.org/wiki/FASTA_format). I want to read each description and each sequence on back tracking.

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).
:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).

fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
   fasta_section(Section),
   fasta_file(Sections).


fasta_section(Section) -->
    fasta_description(Description),
    fasta_seq(Sequence),
    {Section =.. [section,Description,Sequence]}.

fasta_description(Description) -->
    ">",
    string(Description),
    {no_gt(Description),
     no_nl(Description)}.


fasta_seq([]) --> [].
fasta_seq(Seq) -->
    nt([S]),
    fasta_seq(Ss),
    {S="X"->Seq =Ss;Seq=[S|Ss]}.

 nt("A") --> "A".
 nt("C") --> "C".
 nt("G") --> "G".
 nt("T") --> "T".
 nt("X") --> "\n".

 no_gt([]).
 no_gt([E|Es]):-
     dif([E],">"),
     no_gt(Es).

 no_nl([]).
 no_nl([E|Es]):-
     dif([E],"\n"),
     no_nl(Es).

Now this is clearly wrong. The behaviour I would like is

 ?-phrase(fasta_section(S),">frog\nACGGGGTACG\n>duck\nACGTTAG").
 S = section("frog","ACGGGGTACG");
 S = section("duck","ACGTTAG");
 false.

But if I did phrase(fasta_file(Sections),">frog\nACGGGGTACG\n>duck\nACGTTAG). Sections is unified with a list of sections/2s, which is what I want, but my current code seems quite hacky- how I have handled the newline character for example.


Solution

  • for sure, there are 'small' typing problems:

    nt("A") -->"A",
    nt("C") -->"C",
    nt("G") -->"G",
    nt("T") -->"T". 
    

    should be

    nt("A") -->"A".
    nt("C") -->"C".
    nt("G") -->"G".
    nt("T") -->"T". 
    

    anyway, I also had my problems debugging DCG, I wrote a parser to load in Prolog a MySQL dump (plain SQL, really), and was a pain when something unexpected, like escaped strings, or UTF8 (?) weird encodings were found.

    I would suggest to use phrase/3, to see if there is an unparsable tail. Also, could help to place some debug output after known, well behaved sequences.

    Of course, I assume you already tried to use the SWI-Prolog debugger.

    Also, beware of

    ...
    dif([E],">"),
    ...
    

    did you set the appropriate flag about double quotes ? In DCG bodies, the rewrite machinery takes care of matching, but a sequence of codes in SWI-Prolog by default doesn't match double quoted strings...

    edit

    I think this will not solve your doubt about a general strategy... anyway, it's how I would handle the problem...

    fasta_file([]) -->[].
    fasta_file([Section|Sections]) -->
        fasta_section(Section),
        fasta_file(Sections).
    
    fasta_section(section(Description,Sequence)) -->
        fasta_description(Description),
        fasta_seq(SequenceCs), {atom_codes(Sequence, SequenceCs)}, !.
    
    fasta_description(Description) -->
        ">", string(DescriptionCs), "\n", {atom_codes(Description, DescriptionCs)}.
    
    fasta_seq([S|Seq]) --> nt(S), fasta_seq(Seq).
    fasta_seq([]) --> "\n" ; []. % optional \n at EOF
    
    nt(0'A) --> "A".
    nt(0'C) --> "C".
    nt(0'G) --> "G".
    nt(0'T) --> "T".
    

    now

    ?- phrase(fasta_file(S), `>frog\nACGGGGTACG\n>duck\nACGTTAG`).
    S = [section(frog, 'ACGGGGTACG'), section(duck, 'ACGTTAG')] ;
    false.
    

    note: the order of clauses fasta_seq//1 is important, since it implements an 'eager' parsing - mainly for efficiency. As I said, I had to parse SQL, several MBs was common.

    edit

    ?- phrase((string(_),fasta_section(S)), `>frog\nACGGGGTACG\n>duck\nACGTTAG`,_).
    S = section(frog, 'ACGGGGTACG') ;
    S = section(duck, 'ACGTTAG') ;
    false.
    

    fasta_section//1 is mean to match a definite sequence. To get all on backtracking we must provide a backtrack point. In this case, string//1 from library(dcg/basics) does the job