Search code examples
haskellparsec

How to parse multiple lines with start and end tokens in Parsec


I am new at Parsec. Will appreciate pointers for the problem here. Say, I have a csv file with fixed number of headers. Instead of parsing each line separately, I will like to look for a token at the beginning of the line, and get all the lines until next line with non-empty token. Example below:

token,flag,values
a,1,
,,a
,,f
b,2,

Rule for a valid input is: if the token field is filled in, get all the lines until next non-empty token field. So, I will like Parsec to get multiple lines below as first input (those multiple lines can then be parsed by another rule):

a,1,
,,a
,,f

Then, the process starts again on the next line with non-empty token field (last line in the example here). What I am trying to figure out is if there a simple way to specify the rule like that in Parsec - get all the lines that meet a certain rule. They could then be handed off to another parser. Basically, it looks like some kind of lookahead rule to specify what is a valid multi-line input. Did I get it right?

We can ignore the comma separator above for now, and just say that an input begins when a character is found at the beginning of a line, and ends when a character is found at the beginning of a line.


Solution

  • I solved the problem with the help of @user2407038 who suggested the basic outline in a comment. Solution and explanation below (please see the comments after the function - they show how the function behaves with the input):

    {-# LANGUAGE FlexibleContexts #-}
    import Control.Monad
    import Text.Parsec
    import Control.Applicative hiding ((<|>), many)
    
    
    -- | this one accepts everything until newline, and discards the newline
    -- | This one is used as building block in the functions below
    restOfLine :: Stream s m Char => ParsecT s u m [Char]
    restOfLine = many1 (satisfy (\x -> not $ x == '\n')) <* char '\n'
    
    -- | a line with token is "many alphanumeric characters" followed by 
    -- | any characters until newline 
    tokenLine :: Stream s m Char => ParsecT s u m [Char]
    tokenLine =  (++) <$>  many1 alphaNum <*> restOfLine
    
    -- | ghci test:
    -- | *Main Text.Parsec> parseTest tokenLine "a,1,,\n"
    -- | "a,1,,"
    -- | *Main Text.Parsec> parseTest tokenLine ",1,,\n"
    -- | parse error at (line 1, column 1):
    -- | unexpected ","
    -- |expecting letter or digit
    
    -- | a non-token line is a line that has any number of spaces followed
    -- | by ",", then any characters until newline
    nonTokenLine :: Stream s m Char => ParsecT s u m [Char]
    nonTokenLine = (++) <$> (many space) <*> ((:) <$> char ',' <*> restOfLine)
    
    -- | ghci test:
    -- | *Main Text.Parsec> parseTest nonTokenLine ",1,,\n"
    -- | ",1,,"
    -- | *Main Text.Parsec> parseTest nonTokenLine "a,1,,\n"
    -- | parse error at (line 1, column 1):
    -- | unexpected "a"
    -- | expecting space or ","
    
    -- | One entry is tokenLine followed by any number of nonTokenLine
    oneEntry :: Stream s m Char => ParsecT s u m [[Char]]
    oneEntry = (:) <$> tokenLine <*> (many nonTokenLine)
    
    -- | ghci test - please note that it drops last line as expected
    -- | *Main Text.Parsec> parseTest oneEntry "a,1,,\n,,a\n,,f\nb,2,,\n"
    -- | ["a,1,,",",,a",",,f"]
    
    
    -- | We add 'many' to oneEntry to parse the entire file, and get multiple match entries
    multiEntries :: Stream s m Char => ParsecT s u m [[String]]
    multiEntries = many oneEntry
    
    -- | ghci test - please note that it gets two entries as expected
    -- | *Main Text.Parsec> parseTest multiEntries "a,1,,\n,,a\n,,f\nb,2,,\n"
    -- | [["a,1,,",",,a",",,f"],["b,2,,"]]
    

    The parser error seen in the comments is expected on invalid inputs. This can be easily handled. The above code is just a basic building block to get started.