Search code examples
parsinghaskellparser-combinatorsattoparsec

Fast parsing of string that allows escaped characters?


I'm trying to parse a string that can contain escaped characters, here's an example:

import qualified Data.Text as T

exampleParser :: Parser T.Text
exampleParser = T.pack <$> many (char '\\' *> escaped <|> anyChar)
  where escaped = satisfy (\c -> c `elem` ['\\', '"', '[', ']'])

The parser above creates a String and then packs it into Text. Is there any way to parse a string with escapes like the above using the functions for efficient string handling that attoparsec provides? Like string, scan, runScanner, takeWhile, ...

Parsing something like "one \"two\" \[three\]" would produce one "two" [three].

Update:

Thanks to @epsilonhalbe I was able to come out with a generalized solution perfect for my needs; note that the following function doesn't look for matching escaped characters like [..], "..", (..), etc; and also, if it finds an escaped character that is not valid it treats \ as a literal character.

takeEscapedWhile :: (Char -> Bool) -> (Char -> Bool) -> Parser Text
takeEscapedWhile isEscapable while = do
  x <- normal
  xs <- many escaped
  return $ T.concat (x:xs)
  where normal = Atto.takeWhile (\c -> c /= '\\' && while c)
        escaped = do
          x <- (char '\\' *> satisfy isEscapable) <|> char '\\'
          xs <- normal
          return $ T.cons x xs

Solution

  • It is possible writing some escaping code, attoparsec and text - altogether it is pretty straightforward - seeing you have already worked with parsers

    import Data.Attoparsec.Text as AT
    import qualified Data.Text as T
    import Data.Text (Text)
    
    escaped, quoted, brackted :: Parser Text
    normal =  AT.takeWhile (/= '\\')
    escaped = do r <- normal
                 rs <- many escaped'
                 return $ T.concat $ r:rs
      where escaped' = do r1 <- normal
                          r2 <- quoted <|> brackted
                          return $ r1 <> r2
    
    quoted = do string "\\\""
                res <- normal
                string "\\\""
                return $ "\""<>res <>"\""
    
    brackted = do string "\\["
                  res <- normal
                  string "\\]"
                  return $ "["<>res<>"]"
    

    then you can use it to parse the following test cases

    Prelude >: MyModule
    Prelude MyModule> import Data.Attoparsec.Text as AT
    Prelude MyModule AT> import Data.Text.IO as TIO
    Prelude MyModule AT TIO>:set -XOverloadedStrings
    Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test"
    test
    Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\\"test\\\""
    "test"
    Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\[test\\]"
    [test]
    Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test \\\"test\\\" \\[test\\]"
    test "test" [test]
    

    note you have to escape the escapes - that's why you see \\\" instead of \"

    Also if you just parse it will print the Text values escaped, like

    Right "test \"text\" [test]"
    

    for the last example.

    If you parse a file you write simpley escaped text in the file.

    test.txt

    I \[like\] \"Haskell\"
    

    then you can

    Prelude MyModule AT TIO> file <- TIO.readFile "test.txt" 
    Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped file
    I [like] "Haskell"