Flex: lexical analyzer to remove multicomment line in Haskell

I have the follow code :

%{
    #include<stdio.h>
%}

%x multicomment

%option noyywrap
%% 

--(.*) ; 
  
"{-"      BEGIN(multicomment);
<multicomment>[^*\n]+    
<multicomment>"*"        
<multicomment>\n         
<multicomment>"-}"    BEGIN(INITIAL);
%% 
  
int main(int argc,char **argv) 
{ 
    yyin=fopen("Code.txt","r"); 
    yyout=fopen("out.c","w");

    yylex(); 
    return 0; 
}

The task to achievement is pretty simple... remove single/multiline comment from an haskell code.

-- for single line; {- -} for multiline;

The code above works fine if I use "/*" & "*/" (for C comment) instead of "{-" & "-}". When I use the last two, I don't know why flex removes all other characters after {-.

Example, suppose to have the following input text to clean:

some text

{- some other text
    in multiline
    with haskel comment
-}

/* another text
    always in multiline
    but with C comment
*/

some text without comment

If the above code is setted as follow:

    "/*"      BEGIN(multicomment);
    <multicomment>[^*\n]+    
    <multicomment>"*"        
    <multicomment>\n         
    <multicomment>"*/"    BEGIN(INITIAL);

with /*" & "*/" output is right:

some text

{- some other text
    in multiline
    with haskel comment initiator
-}

some text without comment

Instead if I use the original code

    "{-"      BEGIN(multicomment);
    <multicomment>[^*\n]+    
    <multicomment>"*"        
    <multicomment>\n         
    <multicomment>"-}"    BEGIN(INITIAL);

with "{-" & "-}", It doesn't work and the output is:

some text

It delete all characters from "{-" until the end of file, I've also tried other setups recommended in other forums as:

<multicomment>"-\}"    BEGIN(INITIAL);
<multicomment>"-"+"}"    BEGIN(INITIAL);
<multicomment>"-" + "}"    BEGIN(INITIAL);
<multicomment>[-}]    BEGIN(INITIAL);

But in these cases when i try to compile with flex CommentClean.l, this is the result:

CommentClean.l:16: warning, rule cannot be matched

Can someone help me? Where I'am wrong? How can I do?

Solution

You’ve only changed the beginning and ending delimiters, but not the rules to match the contents.

The original rules say “in the multicomment state, ignore one or more non-asterisks and newlines; ignore a single asterisk; and ignore a newline”. An asterisk followed by a slash is matched as the ending delimiter by the longest-match rule.

    <multicomment>[^*\n]+    
    <multicomment>"*"        
    <multicomment>\n

What was happening in your code when you only changed the delimiters is that {- would begin a comment, and then the closing delimiter -} would be consumed as part of the contents, “a series of non-asterisk/newline characters”, which will win because it matches a (much!) longer string.

I think you just need to change the asterisks to hyphens:

    <multicomment>[^-\n]+    
    <multicomment>"-"        
    <multicomment>\n

However, note that this doesn’t account for the fact that in Haskell, unlike in C, multi-line comments may be nested like so:

{-

a multi-line comment

  {-
    containing another comment

    {- containing yet another comment -}

  -}

-}

So to be strictly correct, you should also include a rule that matches multi-line comments recursively. Also bear in mind that -- is only a single-line comment if not part of an operator, so for example --> and |-- are valid operators, not the start of a comment. (And yes, people use these in real code!)

You can find the specification for comments in the Haskell Report §2.3. It says that a symbol is:

Any one of these characters (ascSymbol): ! # $ % & ⋆ + . / < = > ? @ \ ^ | - ~ :; or
Any Unicode character with the properties Symbol (S) or Punctuation (P) (uniSymbol), except for ( ) , ; [ ] ` { } (special) and _ " '.