Search code examples
bisonflex-lexeryacclex

How to output a portion of the value of yytext?


I created a lexer for tokenizing XML documents. I show the lexer at the bottom of this message.

For this XML document:

<?xml version="1.0" encoding="UTF-8"?>
<Document version="1.0">
    <message>Hello, world</message>
</Document>

The lexer produces this output:

Start Tag = <Document
Attribute Name = version
Attribute Value = 1.0
Start Tag = <message
Element Value = Hello, world
End Tag = </message>
End Tag = </Document>

However, I don't want the output to be:

Start Tag = <Document

Instead, I want the output to be:

Start Tag = Document

That is, without the < symbol.

To implement that change, in the main() routine of my lexer I changed this:

printf("Start Tag = %s\n", yytext);

to this:

printf("Start Tag = %s\n", yytext[1]);

As soon as I made that change, the lexer stopped outputting anything (i.e., the output was empty after I made that change). Why does that tiny change result in no output? What is the correct way of outputting the value of yytext except for the first character?

Here is my lexer:

%x ELEMENT_CONTENT
%x ATTRIBUTE
%x QUOTED_ATTRIBUTE_VALUE
%x APOSTROPHED_ATTRIBUTE_VALUE
%{
  enum yytokentype {
    START_TAG = 258,
    END_TAG = 259,
    ELEMENT_VALUE = 260,
    ATTRIBUTE_NAME = 261,
    ATTRIBUTE_VALUE = 262,
    JUNK = 263
  };
  int yyval;
%}
%%
<INITIAL>{
    "<?xml"[^?>]+"?>"[[:space:]]+  {}
    ">"                         {}
    "<"[^/>[:space:]]+          { BEGIN ATTRIBUTE; return(START_TAG); }
    "</"[^[:space:]]+           { return(END_TAG); }
    [[:space:]]+                {}
}
<ATTRIBUTE>{
    ">"                         { BEGIN ELEMENT_CONTENT; }
    "/>"                        { BEGIN INITIAL; }
    [[:space:]]+                {}
    [^[:space:]="'>/]+          { return(ATTRIBUTE_NAME); }
    "="                         {}
    \"                          { BEGIN QUOTED_ATTRIBUTE_VALUE; }
    "'"                         { BEGIN APOSTROPHED_ATTRIBUTE_VALUE; }
}
<QUOTED_ATTRIBUTE_VALUE>{
    [^"]+                       { return(ATTRIBUTE_VALUE); }
    \"                          { BEGIN ATTRIBUTE; }
}
<APOSTROPHED_ATTRIBUTE_VALUE>{
    [^']+                       { return(ATTRIBUTE_VALUE); }
    "'"                         { BEGIN ATTRIBUTE; }
}
<ELEMENT_CONTENT>{
     [[:space:]]+               { BEGIN INITIAL; }
     [^<]+                      { BEGIN INITIAL; return(ELEMENT_VALUE); }
}
%%
int yywrap(){ return 1;}
int main(int argc, char *argv[])
{
    yyin = fopen(argv[1], "r");
    int tok;
    while (tok = yylex()) {
       switch (tok){
          case 258:
             printf("Start Tag = %s\n", yytext);
             break;
          case 259:
             printf("End Tag = %s\n", yytext);
             break;
          case 260:
             printf("Element Value = %s\n", yytext);
             break;
          case 261:
             printf("Attribute Name = %s\n", yytext);
             break;
          case 262:
             printf("Attribute Value = %s\n", yytext);
             break;
          case 263:
             printf("JUNK = %s\n", yytext);
             break;
          default:
             printf(" = invalid token, value = %s\n", yytext);
       }
    }
    fclose(yyin);
    return 0;
}

Solution

  • Take a look on this program (it's C++ but this language ilustrates the problem better):

    #include <iostream>
    
    int main()
    {
        const char text[] = "Hello World!";
        std::cout << text << '\n';
        std::cout << text[1] << '\n';
        std::cout << text + 1 << '\n';
        
        return 0;
    }
    

    It prints:

    Hello World!
    e
    ello World!
    

    See the problem yet? :-)

    You're passing the second character of the string to the printf instead of passing its address. printf expects memory address and tries to use the character as one.

    The solution is:

    printf("Start Tag = %s\n", yytext + 1);
    

    (Be careful to not use it in any rule that allows empty strings.)

    Btw, I would write a function that also removes whitespaces if they are any after the "<".