Search code examples
parsinghaskellparsec

Use Parsec to parse several different kinds of fields


I have a little parsec parser that can parse tab separated values (TSV) into strings. I want to switch to check for numbers and boolean values (listed as "Y" or "N") in the source file.

Here's the old TSV version (returns [[String]])

tsvFile = endBy line newline
line = sepBy cell tab
cell = many (noneOf "\t\n")

I would like to change it to support these types:

data Cell = CellString String
          | CellNumber Int
          | CellBool Bool          
          deriving (Show)

Here are the functions I've defined for number and bool. Are these incorrect?

cellBool = do
    b <- oneOf "YN"
    return $ CellBool (b == 'Y')

cellNumber = do
    d <- many digit
    return $ CellNumber (read d)

cellString = do
    s <- many (noneOf "\t\n")
    return $ CellString s

And here's what I thought I needed to do to get it to work:

cell = cellBool <|> cellNumber <|> cellString

But it doesn't work. Running cellNumber before cellString returns Right []. If I put cellString first in the list, it parses the whole file as strings.

I'm sure I'm missing something basic. Like, only the cellString method is dealing with the tab separator I think, but I'm really new to parsec and confused. I appreciate your help!


Solution

  • I was able to get it working by simply changing the definition of cellNumber:

    cellNumber = do
        d <- many1 digit
        return $ CellNumber (read d)
    

    The problem was that cellNumber was reading an empty string due to the use of many. Using many1 means that parser fails, allowing cellString to execute.

    However, at this point your parser would fail on an input like "123a\n", so you'll need to figure out the backtracking to get that working.


    Using the definition

    cellNumber = do
        d <- many1 digit
        lookAhead $ oneOf "\t\n"
        return $ CellNumber (read d)
    

    probably isn't ideal. Instead, I would consider something like

    cellNumber = do
        d <- many1 digit
        notFollowedBy cellString
        return $ CellNumber (read d)
    

    Then change your cell function to be

    cell = try cellBool <|> try cellNumber <|> cellString