I'm trying to parse a simplified expression language with Parsec in Haskell to solve the Tiny Three-Pass Compiler kata on CodeWars. I'm running into issues where my parser won't parse correctly if there's no whitespace between an identifier and an operator; a * a
parses to the complete expression, but a*a
only yields the first a
.
Stack script to demonstrate the problem:
#!/usr/bin/env stack
-- stack --resolver lts-10.7 script
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as Tok
langDef :: Tok.LanguageDef ()
langDef = Tok.LanguageDef
{ Tok.commentStart = ""
, Tok.commentEnd = ""
, Tok.commentLine = ""
, Tok.nestedComments = False
, Tok.identStart = letter
, Tok.identLetter = letter
, Tok.opStart = oneOf "+-*/"
, Tok.opLetter = oneOf "+-*/"
, Tok.reservedNames = []
, Tok.reservedOpNames = []
, Tok.caseSensitive = True
}
lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser langDef
identifier :: Parser String
identifier = Tok.identifier lexer
reserved :: String -> Parser ()
reserved = Tok.reserved lexer
data AST = Var String
| Add AST AST
| Sub AST AST
| Mul AST AST
| Div AST AST
deriving (Eq, Show)
expression :: Parser AST
expression = term `chainl1` addSubOp
addSubOp :: Parser (AST -> AST -> AST)
addSubOp = (reserved "+" >> return Add)
<|> (reserved "-" >> return Sub)
term :: Parser AST
term = factor `chainl1` multDivOp
multDivOp :: Parser (AST -> AST -> AST)
multDivOp = (reserved "*" >> return Mul)
<|> (reserved "/" >> return Div)
factor :: Parser AST
factor = variable
variable :: Parser AST
variable = do
varName <- identifier
return $ Var varName
main = do
putStrLn $ show $ parse expression "" "a + a"
putStrLn $ show $ parse expression "" "a+a"
putStrLn $ show $ parse expression "" "a - a"
putStrLn $ show $ parse expression "" "a-a"
putStrLn $ show $ parse expression "" "a * a"
putStrLn $ show $ parse expression "" "a*a"
putStrLn $ show $ parse expression "" "a / a"
putStrLn $ show $ parse expression "" "a/a"
Running this outputs:
$ ./AdjacentParseIssue.hs
Right (Add (Var "a") (Var "a"))
Right (Var "a")
Right (Sub (Var "a") (Var "a"))
Right (Var "a")
Right (Mul (Var "a") (Var "a"))
Right (Var "a")
Right (Div (Var "a") (Var "a"))
Right (Var "a")
How can I write my parser so that both a * a
and a*a
parse to the same result?
Tok.reserved
is for identifiers. You should be using Tok.reservedOp
when parsing operators. Consider changing the calls to reserved
to call a similar function:
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
Edit
To clarify what is happening under the hood, here is the implementation of Tok.reserved
:
reserved name =
lexeme $ try $
do{ _ <- caseString name
; notFollowedBy (identLetter languageDef) <?> ("end of " ++ show name)
}
Notice that reserved
blindly accepts name
without validating whether it is a valid operator or identifier, but that it stops if there are further valid identifier characters (otherwise reserved "foo"
would yield incorrect results on a value of fooBar
).
Since you specified that identifiers are any valid letters, Tok.reserved
would halt when further letters were found, which is why "*a"
failed.
Tok.reservedOp
includes a similar restriction which short-circuits parsing when adjacent operator characters (from opLetter
) are present. (e.g. otherwise you could mistake **
(a common exponent expression) for *
)