So I have this lex file:
%{
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include "node.h"
#include "y.tab.h"
char *dupstr(const char *s);
void yyerror(char *s);
int octal(char *s);
%}
%%
\$\$.* ; /* comment */
\$(.|\n)*\$ ; /* comment */
">=" return GE;
"<=" return LE;
":=" return AT;
"~=" return NEQ;
"if" return IF;
"else" return ELSE;
"then" return THEN;
"elif" return ELIF;
"fi" return FI;
"for" return FOR;
"until" return UNTIL;
"step" return STEP;
"do" return DO;
"done" return DONE;
"repeat" return REP;
"stop" return STOP;
"return" return RET;
^"program" return PROG;
^"module" return MOD;
"start" return ST;
^"end" return END;
"void" return VD;
"const" return CT;
"number" return NB;
"array" return ARR;
"string" return SG;
"function" return FC;
"public" return PB;
"forward" return FW;
0|[1-9][0-9]* { errno = 0; yylval.i = strtol(yytext, 0, 10); if (errno == ERANGE)
yyerror("overflow in decimal constant"); return INTEGER; }
0[0-7]+ { yylval.i = octal(yytext); return INTEGER; }
0x[0-9a-fA-F]+ { yylval.i = strtol(yytext, 0, 16); return INTEGER; }
0b[01]+ { errno = 0; yylval.i = strtol(yytext+2, 0, 2); if (errno == ERANGE)
yyerror("overflow in binary constant"); return INTEGER; }
\'[^\\\']\'|\'\\[nrt\\\']\'|\'\\[a-fA-F0-9]\' { yytext[yyleng-1] = 0; yylval.s =
dupstr(yytext+1); return STRING; }
[A-Za-z][A-Za-z0-9_]* { yylval.s = dupstr(yytext+1); return ID; }
\"[^"]*\" { yytext[yyleng-1] = 0; yylval.s = dupstr(yytext+1); return STRING; }
[-+*/%^:=<>~|&?#<\[\]();!,] return *yytext;
[ \t\n\r]+ ; /* ignore whitespace */
. yyerror("Unknown character");
%%
char *getyytext() { return yytext; }
int yywrap(void) {
return 1;
}
int octal(char *s)
{
int i, a = 0, b = 0;
for (i = 0; i < strlen(s); i++) {
if (s[i] < '0' || s[i] > '7') break;
b = b * 8 + s[i] - '0';
if (b < a) {
yyerror("octal overflow");
break;
}
a = b;
}
return a;
}
And I want a restriction that allows me to write anything I want but only if I write it before the tokens program and module or after the token end, is that possible? I tried some options on the respective yacc file but could not do it, also I think this is an issue for lex, sorry in advance it's my first time working with this language and I did not find anything in my research that could help with this problem.
You will need a start condition for that, but it's quite a simple application. Each start condition applies to a different lexical environment. In your case, you basically have two such environments: one corresponding to text which shouldn't be parsed, and the other corresponding to the parts of the text which you want to analyse.
This is often called "island parsing", because you are attempting to parse an island of structured information in a sea of unstructured text.
Lex-based scanner generators have a default start condition called <INITIAL>
, which is the one active when the lexer starts up the first time. Rules in <INITIAL>
don't have to be written with an explicit start condition; other rules do. That's quite irritating in the case of island parsing because most rules are in the island start condition, which means that the condition name has to be preprended to all of them.
But you are almost certainly actually using flex, and if so you can use the useful flex extension which allows a block of rules to be assigned to a start condition. That's how I've written this answer, and if it works for you then you should change any build rules which refer to "lex" so that they correctly name the scanner generator you are using (since if you use flex extensions, you will need to process the file with flex).
Correctly writing a parser requires a great deal of precision in the specification of the input. There are a number of unspecified cases in your brief question; I start by listing the ones I saw, and the resolution I chose (which was usually the least-effort resolution).
In the outer <INITIAL>
start condition, any line of text which doesn't start precisely with the words program
or module
is unstructured text. Your question doesn't indicate how you want this to be handled. You could pass it through to the parser, ignore it, copy it to yyout
, or any number of other alternatives. Here, I'm ignoring it, since that's the simplest. It should be clear what needs to be changed for the other alternatives.
Does the word program
or module
have to be the only thing on the line for it to be recognised? If not, what can follow it? Would, for example, this line qualify:
program"FOO"{
(I have no idea what the grammar of your language is; I'm just raising hypotheticals here.) The simplest solution would be to require the word to be by itself on a line, but that's not a very likely requirement: we often want to put things like comments on the same line as such tokens. On the other hand, it would be very surprising if the line
programming is complicated because we're not using to thinking precisely
were to be treated as the start of a parsed block. So I've made the guess that what counts are lines where program
(or module) are precisely at the beginning of the line, immediately followed by whitespace (or by the end of the line, which is also a whitespace character). That would fail to recognise either of the following:
program$$ This is a comment
program;
But it will recognise
program $$ This is a comment
program MyProgram
So some adjustments may need to be made, depending on your needs.
I also had doubts about the precise handling of the text following the island. Do you expect only a single island? Or could you have:
unstructured text unstructured text program ... end unstructured text module ... end unstructured text
The following assumes that you will want to handle both islands, again because it is the easiest. If, instead, you want to ignore all text which follows the end
, you will need to add a third start condition which just ignores all text. (Alternatively, if you don't want to do anything with the text which follows the island, you could just send reset the input stream after reading the end
token.)
Is it really necessary for the end
token to be at the beginning of a line, once a program
or module
keyword has been encountered? If you require that, then an incorrectly or inadvertently indented end
will be converted into an ID
by your scanner. This seems to me unlikely, so I left out the restriction. I'm also operating under the assumption that a line which starts with end
in unstructured text is still unstructured text; that is, there is no need for the <INITIAL>
rules to even attempt to detect it.
Similarly, it is not clear to me whether program
and module
are legal tokens inside the island, or whether they should be treated as identifiers. If they are legal tokens, is there any good reason to restrict them to appearing at the beginning of a line? I think not, so I left out the restriction.
That said, here's a sample implementation. We start by declaring the start condition (you can read the flex documentation linked for a detailed explanation of why I used %x
to declare it), which must go into the first section of the flex input, before the %%
%x ISLAND
%%
In the <INITIAL>
state, we are only concerned with lines which start program
or module
. As indicated above, we also need to ensure that the target words are followed by whitespace. That's actually a little bit tricky, because negative matches ("lines which don't start with program
or module
") are very difficult to write as regular expressions (without negative lookahead assertions, which (f)lex doesn't provide). Instead of trying to do that, we separately recognise the first word in the line and the rest of the line, which allows us to make use of the longest-match rule. But first, we need to recognise our special cases, which switch start condition using the BEGIN
special action. Here we use flex's "trailing context" operator /
to ensure that the keyword is followed by whitespace:
^program/[[:space:]] { BEGIN(ISLAND); return PROG; }
^module/[[:space:]] { BEGIN(ISLAND); return MOD; }
[[:alpha:]]+ ; /* Any other word (at the beginning of a line) */
[^[:alpha:]\n].* ; /* See below */
\n ; /* The newline at the end of the line */
The third rule matches an alphabetic word at the beginning of a line. [Note 1]
The fourth rule matches both the rest of the line after a word and any line which don't start with a word. We have to be careful not to match a \n
at the beginning of a line; without the exclusion of \n
in the negative character class, the pattern would match the \n
of an empty line and then the entire next line, so it would skip over program
in the case that it followed a blank line. (If that wasn't clear, you might want to experiment.)
The <ISLAND>
start condition is essentially the rules you've already written, wrapped inside of a start condition block. For that reason, I didn't repeat all the rules; only the ones I changed. Note that inside a start condition block, flex lifts the restriction that rules must start at the beginning of a line. Also note that there is no need to quote patterns consisting only of letters and digits. Only patterns with metacharacters need to be quoted.
<ISLAND>{ /* Open the block */
[[:space:]]+ ; /* Ignore whitespace */
end { BEGIN(INITIAL); return END; }
program { return PROG; }
module { return MOD; }
/* And all the rest of the rules. */
}
In theory, the third rule could match an alphabetic word anywhere, since it isn't anchored with ^
. In practice, it's impossible to trigger this rule other than at the beginning of a line because the fourth rule always extends to the end of a line. But in theory, some action could call BEGIN(INITIAL)
at a moment in which the next character to read is alphabetic and not at the beginning of a line. Careful examination of the code will show this to be impossible, but flex can't do that sort of analysis; from flex's perspective, that is a possibility, and if it happens then the third rule will be required.
I know this because I always use %option nodefault
in my flex files, which causes flex to warn me if there is any possibility that no rule will apply to an input. And since I initially wrote rule 3 with an anchor, flex obliged by warning me that it was possible to match the default rule. So I had to remove the anchor in order to remove that warning. But despite the annoyance, I regard that warning is useful, because it is certainly possible that at some point in the future, someone might introduce a BEGIN
action which creates the condition under which an unanchored match of an alphabetic word would be necessary.