Search code examples
haskellconduitattoparsec

Conduit and Attoparsec: unexpected termination on parse error


I'm trying to convert a log file parser that I wrote a while back over to conduit, and I'm running into an issue. I'll simplify the details of the parser itself since that's not relevant to the question. I have a log file that looks like this:

200 GET
404 POST
500 GET
FOO
301 PUT
302 GET
201 POST

So the parsing code is pretty straightforward:

data SimpleLogEntry = SimpleLogEntry {
      status :: Int
    , method :: String
} deriving (Show, Eq)


parseHTTPStatus :: Parser Int
parseHTTPStatus = validate <$> decimal
    where validate d = if (d >= 200 && d < 999) then d else 100


parseHTTPMethod :: Parser String
parseHTTPMethod =
        (stringCI "GET" *> return "Get")
    <|> (stringCI "POST" *> return "Post")
    <|> (stringCI "PUT" *> return "Put")
    <|> return "Unknown"


parseLogLine :: Parser SimpleLogEntry
parseLogLine = fmap SimpleLogEntry
        parseHTTPStatus
    <*> (space *> parseHTTPMethod)

So far so good. Here's how I implemented this in conduit:

import Prelude hiding (lines)

import Control.Applicative
import Control.Monad.IO.Class (liftIO)
import Control.Monad.Trans.Resource (runResourceT, ResourceT)
import Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8
import Data.Conduit
import qualified Data.Conduit.Attoparsec as CA
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL


logLines:: Source (ResourceT IO) B.ByteString
logLines = CB.sourceFile "~/test.log" $= CB.lines


parseEntry :: ConduitM B8.ByteString SimpleLogEntry (ResourceT IO) ()
parseEntry = CA.conduitParserEither parseLogLine =$= awaitForever go
    where
        go (Left err) = liftIO $ putStrLn ("Got an error: " ++ CA.errorMessage err)
        go (Right (_, logEntry)) = yield logEntry


sink :: Sink SimpleLogEntry (ResourceT IO) ()
sink = CL.mapM_ (\t -> liftIO $ putStrLn $ "Got a status: " ++ (show . status) t)


main :: IO ()
main = runResourceT $ logLines $= parseEntry $$ sink

When running main I get this output:

Got a status: 200
Got a status: 404
Got a status: 500
Got an error: Failed reading: takeWhile1

I'm having trouble understanding why the pipeline terminates at this point, rather that continuing to parse the next line of the file, as I'd like to do. Reading the docs for Data.Conduit.Attoparsec, this seems like exactly the use case conduitParserEither was designed for.

UPDATE

Per @Fabian, it turns out that conduitParserEither wasn't really what I wanted here. Here's a definition of parseEntry that does what I wanted to do:

parseEntry' :: ConduitM B8.ByteString SimpleLogEntry (ResourceT IO) ()
parseEntry' = (CL.map (parseOnly parseLogLine)) =$= awaitForever go
    where
        go (Left err) = liftIO $ putStrLn ("Got an error: " ++ err)
        go (Right logEntry) = yield logEntry

Solution

  • The conduitParser (or conduitParserEither) can also consume multiple tokens on one line: for example the following input produces the same result:

    200 GET404 POST
    500 GET
    FOO
    301 PUT
    302 GET
    201 POST
    

    So it makes sense that the parser doesn't continue because it doesn't know where the next token would begin.