I'm trying to parse a multiline log like this
[xxx] This is 1
[xxx] This is also 1
[yyy] This is 2
I have these types defined
{-# LANGUAGE OverloadedStrings #-}
module Parser where
import Prelude hiding(takeWhile)
import Data.Text
import Data.Word
import Data.Attoparsec.Text as T
import Data.Char
import Data.String
data ID = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String]
data Log = Log [Block]
And defined these parsers:
parseID :: Parser ID
parseID = do
char '['
id <- takeTill ( == ']' )
char ']'
return $ ID $ unpack id
parseEntry :: Parser Entry
parseEntry = do
id <- parseID
char ' '
content <- takeTill isEndOfLine
return $ Entry id (unpack content)
This works ok when I do stuff like parseOnly parseEntry entryString
and I get back an Entry
.
The problem is when I try to parse something like the log I added at the start.
I would get a [Entry]
but I would like to get [Block]
.
Also I want that when 2 or more consecutive lines have the same ID (like xxx
) the should get stored into the same block, so for parsing the aforementioned log I'd like to get back
[block1, block2]
-- block1 == Block "xxx" ["This is 1", "This is also 1"]
-- block2 == Block "yyy" ["This is 2"]
How can I make the parser create new blocks or add into the last generated one depending on if the ID
changes?
One obvious solution is to simply generate a [Entry]
and then use a folding function to convert it to [Block]
with the proper logic, but I'd be doing 2 passes, 1 over the log and another over the [Entry]
which seems not only is not too performant for large logs but also feels like the wrong way to do it (from my very limited attoparsec knowledge)
Any other ideas?
Bob Dalgleish solution essentially works (many thanks!!!), just needed a few tweaks to make it work. This is my final solution:
data ID = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String] deriving (Eq, Show)
data Log = Log [Block] deriving (Eq, Show)
parseID :: Parser ID
parseID = do
char '['
id <- takeTill ( == ']' )
char ']'
return $ ID $ unpack id
parseEntry :: Parser Entry
parseEntry = do
id <- parseID
char ' '
content <- takeTill isEndOfLine
return $ Entry id (unpack content)
parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
id <- parseID
if blockId == id
then do
char ' '
content <- takeTill isEndOfLine
endOfLine <|> endOfInput
return $ Entry id (unpack content)
else fail "nonmatching id"
parseBlock :: Parser Block
parseBlock = do
(Entry entryId s) <- parseEntry
let newBlock = Block entryId [s]
endOfLine <|> endOfInput
entries <- many' (parseEntryFor entryId)
return $ Block entryId (s : Prelude.map (\(Entry _ s') -> s') entries)
You need to have a parser for Block
s. It accepts an Entry
, does a lookahead for an Entry
with the same id; if not the same, it backtracks and returns what it has so far.
First, let's introduce a conditional Entry
parser:
parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
id <- parseEntry
if blockId == id
then do
char ' '
content <- takeTill isEndOfLine
endOfLine
return $ Entry id (unpack content)
else fail "nonmatching id"
-- |A Block consists of one or more Entry's with the same ID
parseBlock :: Parser Block
parseBlock = do
(Entry entryId s) <- parseEntry
let newBlock = Block entryId [s]
endOfLine
entries <- many' (parseEntryFor entryId)
return $ Block entryId s: (map (\(Entry _ s') -> x') entries)
(This code is not tested, as I have only ever used Parsec.)