Search code examples
parsinghaskellloggingattoparsec

Parsing multi line log with attoparsec


I'm trying to parse a multiline log like this

[xxx] This is 1
[xxx] This is also 1
[yyy] This is 2

I have these types defined

{-# LANGUAGE OverloadedStrings #-}

module Parser where

import Prelude hiding(takeWhile)
import Data.Text
import Data.Word
import Data.Attoparsec.Text as T
import Data.Char
import Data.String

data ID    = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String]
data Log   = Log [Block]

And defined these parsers:

parseID :: Parser ID
parseID = do
  char '['
  id <- takeTill ( == ']' )
  char ']'
  return $ ID $ unpack id

parseEntry :: Parser Entry
parseEntry = do
  id <- parseID
  char ' '
  content <- takeTill isEndOfLine
  return $ Entry id (unpack content)

This works ok when I do stuff like parseOnly parseEntry entryString and I get back an Entry.

The problem is when I try to parse something like the log I added at the start. I would get a [Entry] but I would like to get [Block].

Also I want that when 2 or more consecutive lines have the same ID (like xxx) the should get stored into the same block, so for parsing the aforementioned log I'd like to get back

[block1, block2]
-- block1 == Block "xxx" ["This is 1", "This is also 1"]
-- block2 == Block "yyy" ["This is 2"]

How can I make the parser create new blocks or add into the last generated one depending on if the ID changes?

One obvious solution is to simply generate a [Entry] and then use a folding function to convert it to [Block] with the proper logic, but I'd be doing 2 passes, 1 over the log and another over the [Entry] which seems not only is not too performant for large logs but also feels like the wrong way to do it (from my very limited attoparsec knowledge)

Any other ideas?

EDIT

Bob Dalgleish solution essentially works (many thanks!!!), just needed a few tweaks to make it work. This is my final solution:

data ID    = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String] deriving (Eq, Show)
data Log   = Log [Block] deriving (Eq, Show)

parseID :: Parser ID
parseID = do
  char '['
  id <- takeTill ( == ']' )
  char ']'
  return $ ID $ unpack id

parseEntry :: Parser Entry
parseEntry = do
  id <- parseID
  char ' '
  content <- takeTill isEndOfLine
  return $ Entry id (unpack content)

parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
  id <- parseID
  if blockId == id
     then do
       char ' '
       content <- takeTill isEndOfLine
       endOfLine <|> endOfInput
       return $ Entry id (unpack content)
  else fail "nonmatching id"

parseBlock :: Parser Block
parseBlock = do
  (Entry entryId s) <- parseEntry
  let newBlock = Block entryId [s]
  endOfLine <|> endOfInput
  entries <- many' (parseEntryFor entryId)
  return $ Block entryId (s : Prelude.map (\(Entry _ s') -> s') entries)

Solution

  • You need to have a parser for Blocks. It accepts an Entry, does a lookahead for an Entry with the same id; if not the same, it backtracks and returns what it has so far.

    First, let's introduce a conditional Entry parser:

    parseEntryFor :: ID -> Parser Entry
    parseEntryFor blockId = do
      id <- parseEntry
      if blockId == id
      then do
             char ' '
             content <- takeTill isEndOfLine
             endOfLine
             return $ Entry id (unpack content)
      else fail "nonmatching id"
    
    -- |A Block consists of one or more Entry's with the same ID
    parseBlock :: Parser Block
    parseBlock = do
      (Entry entryId s) <- parseEntry
      let newBlock = Block entryId [s]
      endOfLine
      entries <- many' (parseEntryFor entryId)
      return $ Block entryId s: (map (\(Entry _ s') -> x') entries)
    

    (This code is not tested, as I have only ever used Parsec.)