Search code examples
whitespaceidrisparser-combinatorslightyear

Lightyear parser behaves in unexpected way


I am trying to build a formatter for Idris with Lightyear. The whole program so far is here:

https://github.com/hejfelix/IdrisFMT/blob/501a4a9e8b1b4154ed0d7836676c24d98de8b76a/IdrisFmt.idr

For now, the purpose is to tokenize the file itself and then pretty print it, i.e., the file as input should be a fixpoint.

The problem comes after each string literal, where my parser seems to eat up whitespace. If I put anything else than whitespace immediately after a string literal, it will parse both that character as well as all the following whitespace.

This sample program will show the error:

main2 : IO ()
main2 = putStrLn $ str
    where
      str = case parse tokenParser "\"IdrisFMT.idr\" \n" of
                 (Left l) => "failed" ++ show l
                 (Right r) => show $ map (show @{default}) r

This prints out:

*IdrisFMT> :exec main2
["StringLiteral(\"IdrisFMT.idr\")"]

If I change the string I'm parsing to "\"IdrisFMT.idr\"c \n", I get:

*IdrisFMT> :exec main2
["StringLiteral(\"IdrisFMT.idr\")", "Identifier(c)", "' '", "'\\n'"]

which is what I expected.

I believe the error arises from the way I parse the string literals, but I am failing to understand my mistake, and I'm having trouble finding a good way to debug lightyear parsers. The implementation of my string literal parser is as follows:

escape : Parser String
escape = do
  d <- char '\\'
  c <- oneOf "\\\"0nrvtbf"
  pure $ pack $ (the $ List Char) [d,c]

nonEscape : Parser String
nonEscape = map (\x => pack $ (the $ List _) [x]) $ noneOf "\\\"\0\n\r\v\t\b\f"

character : Parser String
character = nonEscape <|>| escape

stringLiteralToken : Parser Token
stringLiteralToken = map (StringLiteral . concat) $ dquote (many character)

How can I prevent my string literal parser from eating up whitespace after the literal?


Solution

  • After chatting on the #idris channel, I was helped to understand that most of the built-in higher order parsers (e.g.dquote) skip whitespace at the end. In my case, this was not what I wanted. Instead, I used the between function which takes 3 parameters, a parser for when to start, another for when to stop, and a third for whatever comes in between.

    To parse string literals, I am now doing this:

    escape : Parser String
    escape = do
      d <- char '\\'
      c <- oneOf "\\\"0nrvtbf'"
      pure $ pack $ (the $ List Char) [d,c]
    
    nonEscape : Parser String
    nonEscape = map (\x => pack $ (the $ List _) [x]) $ noneOf "\\\"\0\n\r\v\t\b\f"
    
    character : Parser String
    character = nonEscape <|>| escape
    
    stringLiteralToken : Parser Token
    stringLiteralToken = map (StringLiteral . concat) $ (between (char '"') (char '"')) (many character)
    

    This solved my problem.