Search code examples
parsecbytestringhaskell

Matching bytestrings in Parsec


I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString instead of String, but there is a string combinator which just works with String.

Is there a Parsec combinator similar to string that works with ByteString, without having to do conversions back and forth?

I've seen there is an alternative parser that handles ByteString: attoparsec, but I would prefer to stick with Parsec, since I'm just learning how to use it.


Solution

  • I'm assuming you're starting with something like

    import Prelude hiding (getContents, putStrLn)
    import Data.ByteString
    import Text.Parsec.ByteString
    

    Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.

    Something I noticed along the way:

    • If you import Text.Parsec.ByteString then this uses uncons from Data.ByteString.Char8, which in turn uses w2c from Data.ByteString.Internal, to convert all read bytes to Chars. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to use string and friends without problem.

    Thus, the easy version of the CSV parser, which does exactly that:

    import Prelude hiding (getContents, putStrLn)
    import Data.ByteString (ByteString)
    
    import qualified Prelude (getContents, putStrLn)
    import qualified Data.ByteString as ByteString (getContents)
    
    import Text.Parsec
    import Text.Parsec.ByteString
    
    csvFile :: Parser [[String]]
    csvFile = endBy line eol
    line :: Parser [String]
    line = sepBy cell (char ',')
    cell :: Parser String
    cell = quotedCell <|> many (noneOf ",\n\r")
    
    quotedCell :: Parser String
    quotedCell = 
        do _ <- char '"'
           content <- many quotedChar
           _ <- char '"' <?> "quote at end of cell"
           return content
    
    quotedChar :: Parser Char
    quotedChar =
            noneOf "\""
        <|> try (string "\"\"" >> return '"')
    
    eol :: Parser String
    eol =   try (string "\n\r")
        <|> try (string "\r\n")
        <|> string "\n"
        <|> string "\r"
        <?> "end of line"
    
    parseCSV :: ByteString -> Either ParseError [[String]]
    parseCSV = parse csvFile "(unknown)"
    
    main :: IO ()
    main =
        do c <- ByteString.getContents
           case parse csvFile "(stdin)" c of
                Left e -> do Prelude.putStrLn "Error parsing input:"
                             print e
                Right r -> mapM_ print r
    

    But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString or [Word8] or something similar all the way through? Hence my second attempt below. I am still importing Text.Parsec.ByteString, which may be a mistake, and the code is hopelessly riddled with conversions.

    But, it compiles and has complete type annotations, and therefore should make a sound starting point.

    import Prelude hiding (getContents, putStrLn)
    import Data.ByteString (ByteString)
    import Control.Monad (liftM)
    
    import qualified Prelude (getContents, putStrLn)
    import qualified Data.ByteString as ByteString (pack, getContents)
    import qualified Data.ByteString.Char8 as Char8 (pack)
    
    import Data.Word (Word8)
    import Data.ByteString.Internal (c2w)
    
    import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
    import Text.Parsec.ByteString
    import Text.Parsec.Prim (tokens, tokenPrim)
    import Text.Parsec.Pos (updatePosChar, updatePosString)
    import Text.Parsec.Error (ParseError)
    
    csvFile :: Parser [[ByteString]]
    csvFile = endBy line eol
    line :: Parser [ByteString]
    line = sepBy cell (char ',')
    cell :: Parser ByteString
    cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))
    
    quotedCell :: Parser ByteString
    quotedCell = 
        do _ <- char '"'
           content <- many quotedChar
           _ <- char '"' <?> "quote at end of cell"
           return (ByteString.pack content)
    
    quotedChar :: Parser Word8
    quotedChar =
            noneOf "\""
        <|> try (string "\"\"" >> return (c2w '"'))
    
    eol :: Parser ByteString
    eol =   try (string "\n\r")
        <|> try (string "\r\n")
        <|> string "\n"
        <|> string "\r"
        <?> "end of line"
    
    parseCSV :: ByteString -> Either ParseError [[ByteString]]
    parseCSV = parse csvFile "(unknown)"
    
    main :: IO ()
    main =
        do c <- ByteString.getContents
           case parse csvFile "(stdin)" c of
                Left e -> do Prelude.putStrLn "Error parsing input:"
                             print e
                Right r -> mapM_ print r
    
    -- replacements for some of the functions in the Parsec library
    
    noneOf :: String -> Parser Word8
    noneOf cs   = satisfy (\b -> b `notElem` [c2w c | c <- cs])
    
    char :: Char -> Parser Word8
    char c      = byte (c2w c)
    
    byte :: Word8 -> Parser Word8
    byte c      = satisfy (==c)  <?> show [c]
    
    satisfy :: (Word8 -> Bool) -> Parser Word8
    satisfy f   = tokenPrim (\c -> show [c])
                            (\pos c _cs -> updatePosChar pos c)
                            (\c -> if f (c2w c) then Just (c2w c) else Nothing)
    
    string :: String -> Parser ByteString
    string s    = liftM Char8.pack (tokens show updatePosString s)
    

    Probably your concern, efficiency-wise, should be those two ByteString.pack instructions, in the definitions of cell and quotedCell. You might try to replace the Text.Parsec.ByteString module so that instead of “making strict ByteStrings an instance of Stream with Char token type,” you make ByteStrings an instance of Stream with Word8 token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.

    No, the way to make it more efficient would be to change the types of char, quotedChar and string to Parser [Word8] and the types of line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You could even change the type of eol to Parser (). The necessary changes would look something like this:

    cell :: Parser [Word8]
    cell = quotedCell <|> many (noneOf ",\n\r")
    
    quotedCell :: Parser [Word8]
    quotedCell = 
        do _ <- char '"'
           content <- many quotedChar
           _ <- char '"' <?> "quote at end of cell"
           return content
    
    string :: String -> Parser [Word8]
    string s    = [c2w c | c <- (tokens show updatePosString s)]
    

    You don't need to worry about all the calls to c2w as far as efficiency is concerned, because they cost nothing.

    If this doesn't answer your question, please say what would.