Each nonterminal -- document, messages, message -- in my Bison file returns a string:
%union
{
char *strval;
}
%type <strval> document messages message
My input file contains zero or more messages, with each message followed by a newline (EOL):
messages: message EOL messages { $$ = strcat($1, $3); }
| %empty { $$ = ???; }
;
I don't know what action to use for the %empty part. I tried multiple things. I tried returning the empty string:
{ $$ = ''; }
That resulted in this error message:
message.y: In function 'yyparse':
message.y:25:56: error: empty character constant
25 | | %empty { $$ = ''; }
I tried returning a single space character:
{ $$ = ' '; }
That resulted in this error message:
message.y: In function 'yyparse':
message.y:25:54: warning: assignment to 'char *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
25 | | %empty { $$ = ''; }
I tried casting to a char:
{ $$ = char(' '); }
That resulted in this error message:
message.y: In function 'yyparse':
message.y:26:56: error: expected expression before 'char'
25 | | %empty { $$ = ''; }
Eek! I am out of ideas. What is the correct action to use for an alternative containing a %empty
in a rule that returns a string?
You can return the empty string (""
) or NULL
.
I'd recommend NULL
because I don't like ascribing string literals the type char*
as they tend to point to read-only memory in common compilation strategies (i.e. are relocated into sections mapped into read-only pages at runtime). So the ideal type would be const char*
but that is at odds with the rest of your code. Another strategy would be to heap-allocate an empty string (strdup("")
) but this feels wasteful (and would need to be subsequently freed even though it's an intermediary whose contents will be copied by the concatenation operation). So I'd go with NULL
and some logic to test for it.
You will often see parser action code that explicitly copies tokens' underlying value(s) to the heap.
Your code looks suspicious in that it uses strcat
which writes the result onto the end of the first argument (the destination) and returns the first argument. So, you'd need to know that the argument is large enough (by allocating a fresh heap allocation large enough to store both and then copying the contents of the first string and then strcat
'ing the second onto the end).
This pattern arises quite often in LR parsing. Consider the following grammar which parses a - potentially empty - list of integers:
L -> ε.
L -> int L.
In C, if you use linked lists, you generally identify the empty list as being NULL
. So the actions become (in pseudo-code):
L -> ε { $$ = NULL; }
L -> int L {
struct list* l = malloc(sizeof(struct list));
l->value = $1; // store value
l->next = $2; // store the tail of the list
$$ = l;
}
The list data structure is built up backwards (because LR performs reductions bottom-up). The difference is your code doesn't store by the result of a previous reduction (the non-terminal L
in the right-hand-side of the L -> int L
production) indirectly in what it's building, it concatenates it.
I'd suggest you do something like the following (introducing an assumed definition for message
):
// copies string contents to heap
message: IDENT { $$ = strdup($1); }
messages: message EOL messages {
if ($3 == NULL) {
// no concatenation required
$$ = $1;
} else {
// allocate space for both
char* both = realloc($1, strlen($1) + strlen($3) + 1);
$$ = strcat(both, $3);
free($3);
}
}
| %empty {
$$ = NULL;
}
;
This should behave the way you expect. It creates a fresh allocation for the concatenated string, copies the already-reduced concatenation ($3
) onto the end of the fresh allocation (prefixed by $1
), then frees the already-reduced concatenation ($3
).
Here's a conceptualisation of the reduction steps:
foo EOL (bar EOL (baz EOL ε))
=>
foo EOL (bar EOL (baz EOL ε))
=>
foo EOL (bar EOL "baz")
=>
foo EOL "barbaz"
=>
"foobarbaz"
Despite the nesting being apparent from my use of parentheses, this is pretty close to how the LR stack will appear during parsing (excluding the explicit appearance of the epsilon rule; that reduction - yielding NULL
- will be taken if the current lookahead belongs to FOLLOW(messages)
).