Search code examples
haskellparsec

Parsec csv parser parsing extra line


I have defined the follwing Parsec parser for parsing csv files into a table of strings, i.e. [[String]]

--A csv parser is some rows seperated, and possibly ended, by a newline charater
csvParser = sepEndBy row (char '\n')
--A row is some cells seperated by a comma character
row = sepBy cell (char ',')
--A cell is either a quoted cell, or a normal cell
cell = qcell <|> ncell
--A normal cell is a series of charaters which are neither , or newline. It might also be an escape character
ncell = many (escChar <|> noneOf ",\n")
--A quoted cell is a " followd by some characters which either are escape charaters or normal characters except for "
qcell = do
    char '"'
    res <- many (escChar <|> noneOf "\"")
    char '"'
    return res
--An escape character is anything followed by a \. The \ will be discarded.
escChar = char '\\' >> anyChar

I don't really know if the comments are too much and annoying, of if they are helping. As a Parsec noob they would help me, so I thought I would add them.

It works pretty good, but there is a problem. It creates an extra, empty, row in the table. So if I for example have a csv file with 10 rows(that is, only 10 lines. No empty lines in the end*), the [[String]] structure will have length 11 and the last list of Strings will contain 1 element. An empty String (at least this is how it appears when printing it using show).

My main question is: Why does this extra row appear, and what can I do to stop it?

Another thing I have noted is that if there are empty lines after the data in the csv files, these will end up as rows containing only an empty String in the table. I thought that using sepEndBy instead of sepBy would make the extra empty lines by ignored. Is this not the case?

*After looking at the text file in a hex editor, it seems that it indeed actually ends in a newline character, even though vim doesn't show it...


Solution

  • If you want each row to have at least one cell, you can use sepBy1 instead of sepBy. This should also stop empty rows being parsed as a row. The difference between sepBy and sepBy1 is the same as the difference between many and many1: the 1 version only parses sequences of at least one element. So row becomes this:

    row = sepBy1 cell (char ',')
    

    Also, the usual style is to use sepBy1 in infix: cell `sepBy1` char ','. This reads more naturally: you have a "cell separated by a comma" rather than "separated by cell a comma".

    EDIT: If you don't want to accept empty cells, you have to specify that ncell has at least one character using many1:

    ncell = many1 (escChar <|> noneOf ",\n")