Search code examples
parsinghaskelltrifecta

Haskell/Trifecta: Parsing completely optional semicolons without polluting AST


I have rewritten the question since it was originally posted with a more concise code example:

Consider a language with fully optional semicolons almost entirely as sugar, i.e.:

  • ;; foo; bar;;;; is valid
  • foo bar foobar is valid
  • if (+1); foo is different to if (+1) foo in semantics, so ; cannot be considered whitespace

Here is an example parser:

{-# LANGUAGE OverloadedStrings #-}

import Text.Trifecta
import Text.Trifecta.Delta
import Text.PrettyPrint.ANSI.Leijen (putDoc, (<>), linebreak)
import Control.Monad.Trans.State.Strict
import Control.Applicative

type TestParser a = StateT Int Parser a

data AST a = Foo a | Bar a deriving (Show)

pFoo :: TestParser (AST (Delta, Int))
pFoo = curry Foo <$ string "foo" <*> position <* modify (+1) <*> get

pBar :: TestParser (AST (Delta, Int))
pBar = curry Bar <$ string "bar" <*> position <*> get

pStmt :: TestParser (AST (Delta, Int))
pStmt = semi *> pStmt <|> pFoo <|> pBar <?> "statement"

pTest :: TestParser [AST (Delta, Int)]
pTest = some pStmt

main :: IO ()
main
 = do   let res = parseByteString (evalStateT pTest 0)
                    (Directed "(test)" 0 0 0 0) ";;foo;bar;\nfoo;; foobarbar;;"
        case res of
            Success ast
             -> print ast
            Failure errdoc
             -> putDoc (errdoc <> linebreak)

The problem I am having with such a parser is that I need to be able to skip over semicolons without committing to parse a pStmt. At the moment the following error occurs:

(test):2:18: error: unexpected
    EOF, expected: statement
foo;; foobarbar;;<EOF>

This is because it expects a statement (in semi *> pStmt), however because stacked semicolons can sugar both the beginning and end of expressions I can't be sure I really want to expect/parse one before I already expect one.

One hack I developed was to have Nop as a constructor in my AST, but I really don't want to do that -- it feels like a hack and with the number of semicolons in some documents it would greatly increase memory usage.

I am looking for solutions/suggestions.


Attempt at EBNF form of the desired grammar:

expr = "foo" | "bar"
expr with sugar = expr | ";"
program = { [white space], expr with sugar, [white space] }

Solution

  • Ok, here it is:

    pStmt = pFoo <|> pBar
    
    pWhiteStmt = do
        many whitespace
        p <- pStmt
        many whitespace
        return p
    
    pTest = do
        many semi
        pS <- sepEndBy pWhiteStm (some semi)
        eof
        return pS
    

    And test it:

    > parse pTest "" ";;foo;bar;\nfoo;; foo;bar;bar;;"
    Right ["foo","bar","foo","foo","bar","bar"]
    
    > parse pTest "" ";;foo;bar;\nfoo;; foobarbar;;"
    Left (line 2, column 10):
    unexpected 'b'
    expecting ";" or end of input
    

    If we wish to have a valid "; foobarbar;", then we need to change pWhiteStmt parser to next:

    pWhiteStmt = do
        many whitespace
        p <- some pStmt
        many whitespace
        return p
    

    And check it:

    > parse pTest "" ";;foo;bar;\nfoo;; foobarbar;;"
    Right [["foo"],["bar"],["foo"],["foo","bar","bar"]]
    

    And finally, if we still wish to have valid "; foo bar baz;" then we also need to change pTest function to next:

    pTest = do
        many semi
        pS <- sepEndBy (some pWhiteStm) (some semi)
        eof
        return pS
    

    And test it

    > parse pTest "" ";;foo;bar;\nfoo;; foo bar bar;;"
    Right [[["foo"]],[["bar"]],[["foo"]],[["foo"],["bar"],["bar"]]]
    

    If we have many parentheses , it is need to replace return p to return (concat p) in the pWhiteStmt