Search code examples
stringparsinghaskellparsec

Parsec3 Text parser for quoted string, where everything is allowed in between quotes


I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.

Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.

So,

'aa''''|

should return a string

aa'''

This is what I have:

import Text.Parsec
import Text.Parsec.Text


quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
    spaces
    string "'"
    lab <-  liftM pack $ endBy1 anyChar endOfQuote
    return  lab

endOfQuote = do
    string "'"
    try(eof) <|> try( oneOf "| ")

Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.

How do I fix this? Is there a better way to achieve what I am trying to do?


Solution

  • Whitespace

    First a comment on handling white space...

    Generally the practice is to write your parsers so that they consume the whitespace following a token or syntactic unit. It's common to define combinator like:

    lexeme p = p <* spaces
    

    to easily convert a parser p to one that discards the whitespace following whatever p parses. E.g., if you have

    number = many1 digit
    

    simply use lexeme number whenever you want to eat up the whitespace following the number.

    For more on this approach to handling whitespace and other advice on parsing languages, see this Megaparsec tutorial.

    Label expressions

    Based on your previous SO question it appears you want to parse expressions of the form:

    label1 | label2 | ... | labeln
    

    where each label may be a simple label or a quoted label.

    The idiomatic way to parse this pattern is to use sepBy like this:

    labels :: Parser String
    labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
    

    We define both simpleLabel and quotedLabel in terms of what characters may occur in them. For simpleLabel a valid character is a non-| and non-space:

    simpleLabel :: Parser String
    simpleLabel = many (noneOf "| ")
    

    A quotedLabel is a single quote followed by a run of valid quotedLabel-characters followed by an ending single quote:

    sq = char '\''
    
    quotedLabel :: Parser String
    quotedLabel = do
      char sq
      chs <- many validChar
      char sq
      return chs
    

    A validChar is either a non-single quote or a single quote not followed by eof or a vertical bar:

    validChar = noneOf [sq] <|> try validQuote
    
    validQuote = do
      char sq
      notFollowedBy eof
      notFollowedBy (char '|')
      return sq
    

    The first notFollowedBy will fail if the single quote appears just before the end of input. The second notFollowedBy will fail if next character is a vertical bar. Therefore the sequence of the two will succeed only if there is a non-vertical bar character following the single quote. In this case the single quote should be interpreted as part of the string and not the terminating single quote.

    Unfortunately this doesn't quite work because the current implementation of notFollowedBy will always succeed with a parser which does not consume any input -- i.e. like eof. (See this issue for more details.)

    To work around this problem we can use this alternate implementation:

    notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
    notFollowedBy' p = try $ join $
          do {a <- try p; return (unexpected (show a));}
      <|> return (return ())
    

    Here is the complete solution with some tests. By adding a few lexeme calls you can make this parser eat up any white space where you decide it is not significant.

    import Text.Parsec hiding (labels)
    import Text.Parsec.String
    import Control.Monad
    
    notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
    notFollowedBy' p = try $ join $
          do {a <- try p; return (unexpected (show a));}
      <|> return (return ())
    
    sq = '\''
    
    validChar = do
      noneOf "'" <|> try validQuote
    
    validQuote = do
      char sq
      notFollowedBy' eof
      notFollowedBy (char '|')
      return sq
    
    quotedLabel :: Parser String
    quotedLabel = do
      char sq
      str <- many validChar
      char sq
      return str
    
    plainLabel :: Parser String
    plainLabel = many (noneOf "| ")
    
    labels :: Parser [String]
    labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
    
    test input expected = do
      case parse (labels <* eof) "" input of
        Left e -> putStrLn $ "error: " ++ show e
        Right v -> if v == expected
                     then putStrLn $ "OK - got: " ++ show v
                     else putStrLn $ "NOT OK - got: " ++ show v ++ "  expected: " ++ show expected
    
    test1 = test "a|b|c"      ["a","b","c"]
    test2 = test "a|'b b'|c"  ["a", "b b", "c"]
    test3 = test "'abc''|def" ["abc'", "def" ]
    test4 = test "'abc'"      ["abc"]
    test5 = test "x|'abc'"    ["x","abc"]