Search code examples
haskellparsec

Parsec lookahead to handle ints


I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).

I've made a lot of progress, but am currently stuck with the following.

I need to be able to parse a line somewhat like this:

4  0.123  1.452  0.667  *  3.460  149 - -

Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.

Here's what I have so far:

This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.

 emAnnotationSet = (,,) <$> p_int  <*>
                           (reqSpaces *> char '-') <*>
                           (reqSpaces *> char '-')

the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.

The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.

the parser for p_logProb looks like this:

p_logProb = liftA mkScore (lp <?> "logProb")
          where lp = try dub <|> string "*"
                dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit

And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:

hmmMatchEmissions     = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces) 
                      <* optSpaces <* emAnnotationSet <* eol 
                      <?> "matchEmissions"

So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).

I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:

Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb

column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?


Solution

  • The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.

    import Text.Parsec
    import Text.Parsec.String
    import Data.Char
    import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )
    
    fractional :: Fractional a => Parser a
    fractional = try $ do
      n <- fromIntegral <$> decimal
      char '.'
      f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit  
      return $ n + f
    
    decimal :: Parser Int
    decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit
    
    lexeme :: Parser a -> Parser a
    lexeme p = p <* skipMany (char ' ')
    
    data Row = Row Int [Maybe Double]
               deriving ( Show )
    
    probability :: Fractional a => Parser (Maybe a)
    probability = (Just <$> fractional) <|> (Nothing <$ char '*')
    
    junk = lexeme decimal <* count 2 (lexeme $ char '-')
    
    row :: Parser Row
    row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk
    
    rows :: Parser [Row]
    rows = spaces *> sepEndBy row (lexeme newline) <* eof
    

    Usage:

    *Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -" 
    [Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]