I have the follow code :
%{
#include<stdio.h>
%}
%x multicomment
%option noyywrap
%%
--(.*) ;
"{-" BEGIN(multicomment);
<multicomment>[^*\n]+
<multicomment>"*"
<multicomment>\n
<multicomment>"-}" BEGIN(INITIAL);
%%
int main(int argc,char **argv)
{
yyin=fopen("Code.txt","r");
yyout=fopen("out.c","w");
yylex();
return 0;
}
The task to achievement is pretty simple... remove single/multiline comment from an haskell code.
-- for single line; {- -} for multiline;
The code above works fine if I use "/*" & "*/" (for C comment) instead of "{-" & "-}". When I use the last two, I don't know why flex removes all other characters after {-.
Example, suppose to have the following input text to clean:
some text {- some other text in multiline with haskel comment -} /* another text always in multiline but with C comment */ some text without comment
If the above code is setted as follow:
"/*" BEGIN(multicomment);
<multicomment>[^*\n]+
<multicomment>"*"
<multicomment>\n
<multicomment>"*/" BEGIN(INITIAL);
with /*" & "*/" output is right:
some text {- some other text in multiline with haskel comment initiator -} some text without comment
Instead if I use the original code
"{-" BEGIN(multicomment);
<multicomment>[^*\n]+
<multicomment>"*"
<multicomment>\n
<multicomment>"-}" BEGIN(INITIAL);
with "{-" & "-}", It doesn't work and the output is:
some text
It delete all characters from "{-" until the end of file, I've also tried other setups recommended in other forums as:
<multicomment>"-\}" BEGIN(INITIAL);
<multicomment>"-"+"}" BEGIN(INITIAL);
<multicomment>"-" + "}" BEGIN(INITIAL);
<multicomment>[-}] BEGIN(INITIAL);
But in these cases when i try to compile with flex CommentClean.l, this is the result:
CommentClean.l:16: warning, rule cannot be matched
Can someone help me? Where I'am wrong? How can I do?
You’ve only changed the beginning and ending delimiters, but not the rules to match the contents.
The original rules say “in the multicomment
state, ignore one or more non-asterisks and newlines; ignore a single asterisk; and ignore a newline”. An asterisk followed by a slash is matched as the ending delimiter by the longest-match rule.
<multicomment>[^*\n]+
<multicomment>"*"
<multicomment>\n
What was happening in your code when you only changed the delimiters is that {-
would begin a comment, and then the closing delimiter -}
would be consumed as part of the contents, “a series of non-asterisk/newline characters”, which will win because it matches a (much!) longer string.
I think you just need to change the asterisks to hyphens:
<multicomment>[^-\n]+
<multicomment>"-"
<multicomment>\n
However, note that this doesn’t account for the fact that in Haskell, unlike in C, multi-line comments may be nested like so:
{-
a multi-line comment
{-
containing another comment
{- containing yet another comment -}
-}
-}
So to be strictly correct, you should also include a rule that matches multi-line comments recursively. Also bear in mind that --
is only a single-line comment if not part of an operator, so for example -->
and |--
are valid operators, not the start of a comment. (And yes, people use these in real code!)
You can find the specification for comments in the Haskell Report §2.3. It says that a symbol is:
Any one of these characters (ascSymbol): !
#
$
%
&
⋆
+
.
/
<
=
>
?
@
\
^
|
-
~
:
; or
Any Unicode character with the properties Symbol (S) or Punctuation (P) (uniSymbol), except for (
)
,
;
[
]
`
{
}
(special) and _
"
'
.