Search code examples
haskellparsec

Escaping end of line with Parsec


This time I'm trying to parse a text file into [[String]] using Parsec. Result is a list consisting of lists that represent lines of the file. Every line is a list that contains words which may be separated by any number of spaces, (optionally) commas, and spaces after commas as well.

Here is my code and it even works.

import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Applicative ((<$>))
import System.IO
import System.Environment

myParser :: Parser [[String]]
myParser =
    do x <- sepBy parseColl eol
       eof
       return x

eol :: Parser String
eol = try (string "\n\r")
  <|> try (string "\r\n")
  <|> string "\n"
  <|> string "\r"
  <?> "end of line"

spaces :: Parser ()
spaces = skipMany (char ' ') >> return ()

parseColl :: Parser [String]
parseColl = many parseItem

parseItem :: Parser String
parseItem =
    do optional spaces
       x <- many1 (noneOf " ,\n\r")
       optional spaces
       optional (char ',')
       return x

parseText :: String -> String
parseText str =
    case parse myParser "" str of
      Left e  -> "parser error: " ++ show e
      Right x -> show x

main :: IO ()
main =
    do fileName <- head <$> getArgs
       handle <- openFile fileName ReadMode
       contents <- hGetContents handle
       putStr $ parseText contents
       hClose handle

Test file:

this is my test file
this, line, is, separated, by, commas
and this is another, line

Result:

[["this","is","my","test","file"],
 ["this","line","is","separated","by","commas"],
 ["and","this","is","another","line"],
 []] -- well, this is a bit unexpected, but I can filter things

Now, to make my life harder, I wish to be able to 'escape' eol if there is a comma , before it, even if the comma is followed by spaces. So this is should be considered one line:

this is, spaces may be here
my line

What is best strategy (most idiomatic and elegant) to implement this syntax (without losing the ability to ignore commas inside a line).


Solution

  • A couple of solutions come to mind.... One is easy, the other is medium difficulty.


    The medium-difficulty solution is to define an itemSeparator to be a comma followed by whitespace, and a lineSeparator to be a '\n' or '\r' followed by whitespace.... Make sure to skip non '\n', '\r'-whitespace, but no further, at the end of the item parse, so that the very next char after an item must be either a '\n', '\r', or ',', which determines, without backtracking, whether a new item or line is coming.

    Then use sepBy1 to define parseLine (ie- parseLine = parseItem sepBy1 parseItemSeparator), and endBy to define parseFile (ie- parseFile = parseLine endBy parseLineSeparator).

    You really do need that sepBy1 on the inside, vs sepBy, else you will have a list of zero sized items, which causes an infinite loop at parse time. endBy works like sepBy, but allows extra '\n', '\r' at the end of the file....


    An easier way would be to canonicalize the input by running it though a simple transformation before parsing. You can write a function to remove whitespace after a comma (using dropWhile and isSpace), and perhaps even simplify the different cases of '\n', '\r'.... then run the output through a simplified parser.

    Something like this would do the trick (this is untested....)

    canonicalize::String->String
    canonicalize [] == []
    canonicalize (',':rest) = ',':canonicalize (dropWhile isSpace rest)
    canonicalize ('\n':rest) = '\n':canonicalize (dropWhile isSpace rest)
    canonicalize ('\r':rest) = '\n':canonicalize (dropWhile isSpace rest) --all '\r' will become '\n'
    canonicalize (c:rest) = c:canonicalize rest
    

    Because Haskell is lazy, this transformation will work on streaming data as the data comes in, so this really won't slow anything down at all (depending on how much you simplify the parser, it could even speed things up.... Although most likely it will be close to a wash)

    I don't know how complicated the full question is, but perhaps a few rules added to a canonicalization function will in fact allow you to use lines and words after all....