Search code examples
haskellparsec

parsec: unexpected character parsing nested comments


I am trying to parse nested C-like block comments

import Text.ParserCombinators.Parsec
import Control.Monad (liftM)

flat :: Monad m => m [[a]] -> m [a]
flat = liftM concat

comment :: Parser String
comment = between (string "/*") (string "*/") (try nested <|> content)
  where
    content = many (try (noneOf "*/")
                   <|> try (char '*' >> notFollowedBy (char '/') >> return '*')
                   <|> try (char '/' >> notFollowedBy (char '*') >> return '/'))
    nested  = flat $ many comment

"1234567890" parses fine, but when I try

parse comment "" "/*123/*456*/789*/"

I get

Left (line 1, column 3):
unexpected "1"
expecting "/*" or "*/"

I cannot figure out why, I have try everywhere I can think of. Please help.


Solution

  • In an expression like a <|> b, if a can match the empty string then b will never be tried, and this is happening in try nested <|> content.

    You can fix your approach by requiring at least one comment match or another character:

    comment :: Parser String
    comment = between (string "/*") (string "*/") ( flat $ many $ (try comment <|> liftM toString other ) )
      where
        toString x = [x]
        other = try (noneOf "*/")
                <|> try (char '*' >> notFollowedBy (char '/') >> return '*')
                <|> try (char '/' >> notFollowedBy (char '*') >> return '/')
    

    FWIW, here is how Text.Parsec.Token does it:

    https://github.com/aslatter/parsec/blob/master/Text/Parsec/Token.hs#L698-714

    For your specific case the equivalent code is:

    import Data.List (nub)
    
    commentStart = "/*"
    commentEnd = "*/"
    
    multiLineComment =
        do { try (string commentStart)
           ; inComment
           }
    
    inComment = inCommentMulti
    
    inCommentMulti
        =   do{ try (string commentEnd) ; return () }
        <|> do{ multiLineComment                     ; inCommentMulti }
        <|> do{ skipMany1 (noneOf startEnd)          ; inCommentMulti }
        <|> do{ oneOf startEnd                       ; inCommentMulti }
        <?> "end of comment"
        where
          startEnd   = nub (commentEnd ++ commentStart)