I have defined the follwing Parsec parser for parsing csv files into a table of strings, i.e. [[String]]
--A csv parser is some rows seperated, and possibly ended, by a newline charater
csvParser = sepEndBy row (char '\n')
--A row is some cells seperated by a comma character
row = sepBy cell (char ',')
--A cell is either a quoted cell, or a normal cell
cell = qcell <|> ncell
--A normal cell is a series of charaters which are neither , or newline. It might also be an escape character
ncell = many (escChar <|> noneOf ",\n")
--A quoted cell is a " followd by some characters which either are escape charaters or normal characters except for "
qcell = do
char '"'
res <- many (escChar <|> noneOf "\"")
char '"'
return res
--An escape character is anything followed by a \. The \ will be discarded.
escChar = char '\\' >> anyChar
I don't really know if the comments are too much and annoying, of if they are helping. As a Parsec noob they would help me, so I thought I would add them.
It works pretty good, but there is a problem. It creates an extra, empty, row in the table. So if I for example have a csv file with 10 rows(that is, only 10 lines. No empty lines in the end*), the [[String]]
structure will have length 11 and the last list of String
s will contain 1 element. An empty String
(at least this is how it appears when printing it using show
).
My main question is: Why does this extra row appear, and what can I do to stop it?
Another thing I have noted is that if there are empty lines after the data in the csv files, these will end up as rows containing only an empty String
in the table. I thought that using sepEndBy
instead of sepBy
would make the extra empty lines by ignored. Is this not the case?
*After looking at the text file in a hex editor, it seems that it indeed actually ends in a newline character, even though vim doesn't show it...
If you want each row to have at least one cell, you can use sepBy1
instead of sepBy
. This should also stop empty rows being parsed as a row. The difference between sepBy
and sepBy1
is the same as the difference between many
and many1
: the 1
version only parses sequences of at least one element. So row
becomes this:
row = sepBy1 cell (char ',')
Also, the usual style is to use sepBy1
in infix: cell `sepBy1` char ','
. This reads more naturally: you have a "cell separated by a comma" rather than "separated by cell a comma".
EDIT: If you don't want to accept empty cells, you have to specify that ncell
has at least one character using many1
:
ncell = many1 (escChar <|> noneOf ",\n")