Search code examples
bisonflex-lexer

Why does my parser incorrectly read a part, when changing another part?


I have these tokens defined in my lex file:

(?xi:
    ADC|AND|ASL|BIT|BRK|CLC|CLD|CLI|CLV|CMP|CPX|
    DEY|EOR|INC|INX|INY|JMP|JSR|LDA|LDX|LDY|LSR|
    NOP|ORA|PHA|PHP|PLA|PLP|ROL|ROR|RTI|RTS|SBC|
    SEC|SED|SEI|STA|STX|STY|TAX|TAY|TSX|TXA|TXS|
    TYA|CPY|DEC|DEX
) {
    yylval.str = strdup(yytext);
    for(char *ptr = yylval.str; *ptr = tolower(*ptr); ptr++);

    return MNEMONIC;
}

[\(\)=Aa#XxYy,:\+\-\<\>] {
    return *yytext;
}

\$[0-9a-fA-F]{4} {
    yylval.str = strdup(yytext);
    return ABSOLUTE;
}

\$[0-9a-fA-F]{2} {
    yylval.str = strdup(yytext);
    return ZEROPAGE;
}

and this is how I parse them in bison:

struct addr_offset {
    char *str;
    int offset;
};
%union {
    char *str;
    int number;
    struct addr_offset *ao;
}

%type<str> MNEMONIC
%type<str> ABSOLUTE
%type<ao> zp
%token ZEROPAGE

expression:
    MNEMONIC                                { statement(0,  $1, NULL,   "i"); }
|   MNEMONIC zp                             { statement(5,  $1, $2,     }
;

zp:
    ZEROPAGE { $$->str = strdup($1); }
|   '>' ABSOLUTE { $$->str = strdup($2); }
|   '<' ABSOLUTE { $$->str = strdup($2); }
;

Weird thing is, if I add the last two parts to the zp rule, the MNEMONIC is not read correctly in the expression rule.


Solution

  • If you don't set $$ in a rule, bison will by default initialize it with the value of $1. If that is a different %type than $$ is expecting, bad things will happen.

    In the case you are describing, it will likely be the value associated with the < or > token. Since those tokens don't set yylval in the lexer code, it will be whatever happens to be there from the previous token -- in this case, the string allocated with strdup for MNEMONIC. So when you assign to $$->str, it will treat the string as if it is a pointer to the data structure in question, and will overwrite 4 or 8 characters in the string with the pointer to another string that is being assigned there.

    So the likely result will be some heap corruption which will manifest as bad/corrupted opcodes when you go to look at them.


    So with the addition of the %union/%type declarations, we can see what is happening -- your're allocating a string and then treating the string's memory as a struct ao, which causes heap corruption and undefined behavior.

    You need your actions that return a struct ao to actually allocate a struct ao:

    zp:
        ZEROPAGE { $$ = malloc(sizeof(struct ao); $$->str = $1; }
    |   '>' ABSOLUTE { $$ = malloc(sizeof(struct ao); $$->str = $2; }
    |   '<' ABSOLUTE { $$ = malloc(sizeof(struct ao); $$->str = $2; }
    ;
    

    Note that you don't need a strdup here, as the string has already been allocated in the lexer code, and you're just transferring ownership of that string from the token to the new struct ao you're creating.

    You might want to encapsulation the creation of the ao object in a function:

    struct ao *new_ao(char *addr) {
        struct ao *rv = malloc(sizeof(struct ao));
        rv->str = addr;
        rv->offset = strtol(addr, 0, 16);
        return rv;
    }
    

    then your actions just become eg, { $$ = new_ao($1); }