This time I'm trying to parse a text file into [[String]]
using Parsec. Result is a list consisting of lists that represent lines of the file. Every line is a list that contains words which may be separated by any number of spaces, (optionally) commas, and spaces after commas as well.
Here is my code and it even works.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Applicative ((<$>))
import System.IO
import System.Environment
myParser :: Parser [[String]]
myParser =
do x <- sepBy parseColl eol
eof
return x
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
spaces :: Parser ()
spaces = skipMany (char ' ') >> return ()
parseColl :: Parser [String]
parseColl = many parseItem
parseItem :: Parser String
parseItem =
do optional spaces
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
parseText :: String -> String
parseText str =
case parse myParser "" str of
Left e -> "parser error: " ++ show e
Right x -> show x
main :: IO ()
main =
do fileName <- head <$> getArgs
handle <- openFile fileName ReadMode
contents <- hGetContents handle
putStr $ parseText contents
hClose handle
Test file:
this is my test file
this, line, is, separated, by, commas
and this is another, line
Result:
[["this","is","my","test","file"],
["this","line","is","separated","by","commas"],
["and","this","is","another","line"],
[]] -- well, this is a bit unexpected, but I can filter things
Now, to make my life harder, I wish to be able to 'escape' eol
if there is a comma ,
before it, even if the comma is followed by spaces. So this is should be considered one line:
this is, spaces may be here
my line
What is best strategy (most idiomatic and elegant) to implement this syntax (without losing the ability to ignore commas inside a line).
A couple of solutions come to mind.... One is easy, the other is medium difficulty.
The medium-difficulty solution is to define an itemSeparator to be a comma followed by whitespace, and a lineSeparator to be a '\n' or '\r' followed by whitespace.... Make sure to skip non '\n', '\r'-whitespace, but no further, at the end of the item parse, so that the very next char after an item must be either a '\n', '\r', or ',', which determines, without backtracking, whether a new item or line is coming.
Then use sepBy1
to define parseLine
(ie- parseLine = parseItem sepBy1
parseItemSeparator), and endBy to define parseFile (ie- parseFile = parseLine endBy
parseLineSeparator).
You really do need that sepBy1
on the inside, vs sepBy
, else you will have a list of zero sized items, which causes an infinite loop at parse time. endBy
works like sepBy
, but allows extra '\n', '\r' at the end of the file....
An easier way would be to canonicalize the input by running it though a simple transformation before parsing. You can write a function to remove whitespace after a comma (using dropWhile
and isSpace
), and perhaps even simplify the different cases of '\n', '\r'.... then run the output through a simplified parser.
Something like this would do the trick (this is untested....)
canonicalize::String->String
canonicalize [] == []
canonicalize (',':rest) = ',':canonicalize (dropWhile isSpace rest)
canonicalize ('\n':rest) = '\n':canonicalize (dropWhile isSpace rest)
canonicalize ('\r':rest) = '\n':canonicalize (dropWhile isSpace rest) --all '\r' will become '\n'
canonicalize (c:rest) = c:canonicalize rest
Because Haskell is lazy, this transformation will work on streaming data as the data comes in, so this really won't slow anything down at all (depending on how much you simplify the parser, it could even speed things up.... Although most likely it will be close to a wash)
I don't know how complicated the full question is, but perhaps a few rules added to a canonicalization function will in fact allow you to use lines
and words
after all....