Search code examples
parsinghaskellcommentsflex-lexer

Flex: lexical analyzer to remove multicomment line in Haskell


I have the follow code :

%{
    #include<stdio.h>
%}

%x multicomment

%option noyywrap
%% 

--(.*) ; 
  
"{-"      BEGIN(multicomment);
<multicomment>[^*\n]+    
<multicomment>"*"        
<multicomment>\n         
<multicomment>"-}"    BEGIN(INITIAL);
%% 
  
int main(int argc,char **argv) 
{ 
    yyin=fopen("Code.txt","r"); 
    yyout=fopen("out.c","w");

    yylex(); 
    return 0; 
} 

The task to achievement is pretty simple... remove single/multiline comment from an haskell code.

-- for single line; {- -} for multiline;

The code above works fine if I use "/*" & "*/" (for C comment) instead of "{-" & "-}". When I use the last two, I don't know why flex removes all other characters after {-.

Example, suppose to have the following input text to clean:

some text

{- some other text
    in multiline
    with haskel comment
-}

/* another text
    always in multiline
    but with C comment
*/

some text without comment

If the above code is setted as follow:

    "/*"      BEGIN(multicomment);
    <multicomment>[^*\n]+    
    <multicomment>"*"        
    <multicomment>\n         
    <multicomment>"*/"    BEGIN(INITIAL);

with /*" & "*/" output is right:

some text

{- some other text
    in multiline
    with haskel comment initiator
-}

some text without comment

Instead if I use the original code

    "{-"      BEGIN(multicomment);
    <multicomment>[^*\n]+    
    <multicomment>"*"        
    <multicomment>\n         
    <multicomment>"-}"    BEGIN(INITIAL);

with "{-" & "-}", It doesn't work and the output is:

some text

It delete all characters from "{-" until the end of file, I've also tried other setups recommended in other forums as:

<multicomment>"-\}"    BEGIN(INITIAL);
<multicomment>"-"+"}"    BEGIN(INITIAL);
<multicomment>"-" + "}"    BEGIN(INITIAL);
<multicomment>[-}]    BEGIN(INITIAL);

But in these cases when i try to compile with flex CommentClean.l, this is the result:

CommentClean.l:16: warning, rule cannot be matched

Can someone help me? Where I'am wrong? How can I do?


Solution

  • You’ve only changed the beginning and ending delimiters, but not the rules to match the contents.

    The original rules say “in the multicomment state, ignore one or more non-asterisks and newlines; ignore a single asterisk; and ignore a newline”. An asterisk followed by a slash is matched as the ending delimiter by the longest-match rule.

        <multicomment>[^*\n]+    
        <multicomment>"*"        
        <multicomment>\n 
    

    What was happening in your code when you only changed the delimiters is that {- would begin a comment, and then the closing delimiter -} would be consumed as part of the contents, “a series of non-asterisk/newline characters”, which will win because it matches a (much!) longer string.

    I think you just need to change the asterisks to hyphens:

        <multicomment>[^-\n]+    
        <multicomment>"-"        
        <multicomment>\n 
    

    However, note that this doesn’t account for the fact that in Haskell, unlike in C, multi-line comments may be nested like so:

    {-
    
    a multi-line comment
    
      {-
        containing another comment
    
        {- containing yet another comment -}
    
      -}
    
    -}
    

    So to be strictly correct, you should also include a rule that matches multi-line comments recursively. Also bear in mind that -- is only a single-line comment if not part of an operator, so for example --> and |-- are valid operators, not the start of a comment. (And yes, people use these in real code!)

    You can find the specification for comments in the Haskell Report §2.3. It says that a symbol is:

    • Any one of these characters (ascSymbol): ! # $ % & + . / < = > ? @ \ ^ | - ~ :; or

    • Any Unicode character with the properties Symbol (S) or Punctuation (P) (uniSymbol), except for ( ) , ; [ ] ` { } (special) and _ " '.