Search code examples
haskellparsecparser-combinators

parsec: parse nested code blocks


I want to parse following text:

keyword some more values
        funcKeyw funcName1
        funcKeyw funcName2

        funcKeyw funcName3

        keyword some more values
                 funcKeyw funcName2

        keyword some more values
                 funcKeyw funcName4

Indentation is done by tabs. Each block is started by keyword and some additional values in the same line. Everything indented belongs to the same block. After all the function calls (that start with the funcKeyw keyword) there can be child keyword blocks (separated by an "empty" line; "empty" means either nothing in it or whitespace characters).

type IndentLevel = Int

data Block = Block { blockFuncCalls :: [String]
                   , blockBlocks    :: [Block]
                   }

block :: GenParser Char st Block
block = parseBlock 0
    where
        parseBlock lvl = do
            count lvl tab
            string "keyword"
            -- [...] Parse other stuff in that line.
            newline

            -- Parse 'function calls'.
            fs <- sepBy1 (blockFunc (lvl + 1)) emptyLines
            -- Parse optional child blocks.
            emptyLines
            bs <- sepBy (parseBlock (lvl + 1)) emptyLines

            return Block { blockFuncCalls=fs
                         , blockBlocks=bs
                         }

blockFunc :: IndentLevel -> GenParser Char st String
blockFunc lvl = do
    count lvl tab
    string "funcKeyw"
    -- [...] Parse function name etc..
    newline
    return funcName -- Parsed func name.

emptyLine :: GenParser Char st ()
emptyLine = many (oneOf "\t ") >> newline >> return ()

emptyLines :: GenParser Char st ()
emptyLines = many emptyLine >> return ()

The problem is that the blockFunc parser does not stop parsing when the child block starts, but returns an error unexpected 'keyword'.

How can I avoid that? I think I could use try or choice to choose the correct parser for each line, but I want to require the function calls to be before the child blocks.


Solution

  • One thing I have noticed is that the sepBy combinators have somewhat unexpected behaviour, namely, that if the separator is began to be parsed, and that fails, the entire sepBy fails, instead of simple returning what was parsed so far. You can use the following variants, which differ by an additional try inside of sepBy1Try:

    sepBy1Try :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m sep -> ParsecT s u m [a]
    sepBy1Try p sep = do
      x <- p
      xs <- many (try $ sep *> p)
      return (x:xs)
    
    sepByTry p sep = sepBy1Try p sep <|> return []
    

    Use these in place of sepBy:

    block :: GenParser Char st Block
    block = parseBlock 0
        where
            parseBlock lvl = do
                count lvl tab
                string "keyword"
    
                otherStuff <- many (noneOf "\r\n") 
                newline
    
                -- Parse 'function calls'.
                fs <- sepBy1Try (blockFunc (lvl + 1)) emptyLines
    
                -- Parse optional child blocks.
                emptyLines
                bs <- sepByTry (try $ parseBlock (lvl + 1)) emptyLines
    
                return Block { blockFuncCalls=fs
                             , blockBlocks=bs
                             , blockValues=words otherStuff
                             }
    

    I also modified your datatype to also capture some more information (just for demonstration purposes). Also, note yet another try in front of the recursive parseBlock - this is because this parses must fail without consuming input, when it sees for example one tab, but is expecting two, this try allows it to backtrack to the "next level".

    Finally, change the following:

    emptyLines :: GenParser Char st ()
    emptyLines = many (try emptyLine) >> return ()
    

    Same reasoning as sepBy here...


    A test with a simple pretty printer for clarity:

    data Block = Block { blockValues :: [String]
                       , blockFuncCalls :: [String]
                       , blockBlocks    :: [Block]
                       } deriving (Show, Eq) 
    
    pprBlock :: Block -> String 
    pprBlock = unlines . go id where 
      go ii (Block vals funcs subblocks) = 
        let ii' = ii . ('\t':) in 
        (ii $ unwords $ "keyword":vals) : 
        map (\f -> ii' $ "function " ++ f) funcs ++ 
        concatMap (go ii') subblocks
    
    test0_run = either (error.show) (putStrLn.pprBlock) $ parse block "" $ test0
    
    test0 = unlines $ 
      [ "keyword some more values"
      , "\tfuncKeyw funcName1"
      , "\tfuncKeyw funcName2"
      , "\t"
      , "\tfuncKeyw funcName3"
      , "\t"
      , "\tkeyword some more values"
      , "\t\tfuncKeyw funcName2"
      , ""
      , "\tkeyword some more values"
      , "\t\tfuncKeyw funcName4"
      ]
    

    and

    >test0_run
    keyword some more values
            function funcName1
            function funcName2
            function funcName3
            keyword some more values
                    function funcName2
            keyword some more values
                    function funcName4
    
    >putStrLn test0
    keyword some more values
            funcKeyw funcName1
            funcKeyw funcName2
    
            funcKeyw funcName3
    
            keyword some more values
                    funcKeyw funcName2
    
            keyword some more values
                    funcKeyw funcName4
    
    >