Search code examples
parsinghaskellparsec

Trouble parsing letters adjacent to operators with Parsec


I'm trying to parse a simplified expression language with Parsec in Haskell to solve the Tiny Three-Pass Compiler kata on CodeWars. I'm running into issues where my parser won't parse correctly if there's no whitespace between an identifier and an operator; a * a parses to the complete expression, but a*a only yields the first a.

Stack script to demonstrate the problem:

#!/usr/bin/env stack
-- stack --resolver lts-10.7 script

import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as Tok

langDef :: Tok.LanguageDef ()
langDef = Tok.LanguageDef
  { Tok.commentStart    = ""
  , Tok.commentEnd      = ""
  , Tok.commentLine     = ""
  , Tok.nestedComments  = False
  , Tok.identStart      = letter
  , Tok.identLetter     = letter
  , Tok.opStart         = oneOf "+-*/"
  , Tok.opLetter        = oneOf "+-*/"
  , Tok.reservedNames   = []
  , Tok.reservedOpNames = []
  , Tok.caseSensitive   = True
  }

lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser langDef

identifier :: Parser String
identifier = Tok.identifier lexer

reserved :: String -> Parser ()
reserved = Tok.reserved lexer

data AST = Var String
         | Add AST AST
         | Sub AST AST
         | Mul AST AST
         | Div AST AST
         deriving (Eq, Show)

expression :: Parser AST
expression = term `chainl1` addSubOp

addSubOp :: Parser (AST -> AST -> AST)
addSubOp =  (reserved "+" >> return Add)
        <|> (reserved "-" >> return Sub)

term :: Parser AST
term = factor `chainl1` multDivOp

multDivOp :: Parser (AST -> AST -> AST)
multDivOp =  (reserved "*" >> return Mul)
         <|> (reserved "/" >> return Div)

factor :: Parser AST
factor = variable

variable :: Parser AST
variable = do
  varName <- identifier
  return $ Var varName

main = do
  putStrLn $ show $ parse expression "" "a + a"
  putStrLn $ show $ parse expression "" "a+a"
  putStrLn $ show $ parse expression "" "a - a"
  putStrLn $ show $ parse expression "" "a-a"
  putStrLn $ show $ parse expression "" "a * a"
  putStrLn $ show $ parse expression "" "a*a"
  putStrLn $ show $ parse expression "" "a / a"
  putStrLn $ show $ parse expression "" "a/a"

Running this outputs:

$ ./AdjacentParseIssue.hs 
Right (Add (Var "a") (Var "a"))
Right (Var "a")
Right (Sub (Var "a") (Var "a"))
Right (Var "a")
Right (Mul (Var "a") (Var "a"))
Right (Var "a")
Right (Div (Var "a") (Var "a"))
Right (Var "a")

How can I write my parser so that both a * a and a*a parse to the same result?


Solution

  • Tok.reserved is for identifiers. You should be using Tok.reservedOp when parsing operators. Consider changing the calls to reserved to call a similar function:

    reservedOp :: String -> Parser ()
    reservedOp = Tok.reservedOp lexer
    

    Edit

    To clarify what is happening under the hood, here is the implementation of Tok.reserved:

    reserved name =
        lexeme $ try $
        do{ _ <- caseString name
          ; notFollowedBy (identLetter languageDef) <?> ("end of " ++ show name)
          }
    

    Notice that reserved blindly accepts name without validating whether it is a valid operator or identifier, but that it stops if there are further valid identifier characters (otherwise reserved "foo" would yield incorrect results on a value of fooBar).

    Since you specified that identifiers are any valid letters, Tok.reserved would halt when further letters were found, which is why "*a" failed.

    Tok.reservedOp includes a similar restriction which short-circuits parsing when adjacent operator characters (from opLetter) are present. (e.g. otherwise you could mistake ** (a common exponent expression) for *)