Search code examples
cbisonyacclex

How to find a substring


Given a text file containing a string, I would find some specific substrings/sequences inside this string.

Bison file .y (Declaration+Rules)

%token <cval> AMINO 
%token STARTCODON STOPCODON
%type <cval> seq

%%
series: STARTCODON seq STOPCODON {printf("%s", $2);}
seq: AMINO
         | seq AMINO
         ;
%%

Here I want to print every sequence between STARTCODON and STOPCODON

Flex file .l (Rules)

    %%
    ("ATG")+ {return STARTCODON;}
    ("TAA"|"TAG"|"TGA")+ {return STOPCODON;}

    ("GCT"|"GCC"|"GCA"|"GCG")+ {yylval.cval = 'A';
                           return AMINO;}
    ("CGT"|"CGC"|"CGA"|"CGG"|"AGA"|"AGG")+ {yylval.cval = 'R';
                           return AMINO;}
     .
     .
     .

     [ \t]+                  /*ignore whitespace*/
     \n                      /*ignore end of line*/
     . {printf("-");}
     %%

When I run the code I get only the output of the rule . {printf("-");}.

I am new Flex/Bison, I suspect that:

  • The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.

  • Flex doesn't subdivide correctly the entire string into tokens of 3 characters.

EDIT:

(Example) Input file: DnaSequence.txt:

Input string:cccATGAATTATTAGzzz, where lower characters (ccc, zzz) produce the (right) output -, ATG is the STARTCODON, AATTAT is the sequence of two AMINO (AAT TAT), and TAG is the STOPCODON.

This input string produces the (wrong) output ---.

EDIT:

Following the suggestions of @JohnBollinger I have added <<EOF>> {return ENDTXT;} in the Flex file, and the rule finalseries: series ENDTXT; in the Bison file.

Now it's returning the yyerror's error message, indicating a parsing error. I suppose that we need a STARTTXT token, but I don't know how to implement it.


Solution

  • I am new Flex/Bison, I suspect that:

    • The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.

    The rule is syntactically acceptable. It would be semantically correct if the value of token 2 were a C string, in which case it would cause that value to be printed to the standard output, but your Flex file appears to assume that type <cval> is char, which is not a C string, nor directly convertible to one.

    • Flex doesn't subdivide correctly the entire string into tokens of 3 characters.

    Your Flex input looks OK to me, actually. And the example input / output you present indicates that Flex is indeed recognizing all your triplets from ATG to TAG, else the rule for . would be triggered more than three times.


    The datatype problem is a detail that you'll need to sort out, but the main problem is that your production for seq does not set its semantic value. How that results in (seemingly) nothing being printed when the series production is used for a reduction depends on details that you have not disclosed, and probably involves undefined behavior.

    If <cval> were declared as a string (char *), and if your lexer set its values as strings rather than as characters, then setting the semantic value might look something like this:

    seq: AMINO { $$ = calloc(MAX_AMINO_ACIDS + 1, 1); /* check for allocation failure ... */
                 strcpy($$, $1); }
             | seq AMINO { $$ = $1; strcat($$, $2); }
             ;
    

    You might consider sticking with char as the type for the semantic value of AMINO, and defining seq to have a different type (i.e. char *). That way, your changes could be restricted to the grammar file. That would, however, call for a different implementation of the semantic actions in the production for seq.


    Finally, note that although you say

    Here I want to print every sequence between STARTCODON and STOPCODON

    your grammar, as presented, has series as its start symbol. Thus, once it reduces the token sequence to a series, it expects to be done. If additional tokens follow (say those of another series) then that would be erroneous. If that's something you need to support then you'll need a higher-level start symbol representing a sequence of multiple series.