Search code examples
parsinghaskellrecursionparsec

Implementing a recursive parser without try


I am trying to parse an XML dump of Wikipedia to find certain links on each page using the Haskell Parsec library. Links are denoted by double brackets: texttext[[link]]texttext. To simplify the scenario as much as possible, let's say I am looking for the first link not enclosed in double curly braces (which can be nested): {{ {{ [[Wrong Link]] }} [[Wrong Link]] }} [[Right Link]]. I have written a parser to discard links which are enclosed in non-nested double braces:

import Text.Parsec

getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"

linkParser = do
    beforeLink
    link <- many $ noneOf "]"
    string "]]"
    return link

beforeLink = manyTill (many notLink) (try $ string "[[")

notLink = try doubleCurlyBrac <|> (many1 normalText)

normalText = noneOf "[{"
           <|> notFollowedByItself '['
           <|> notFollowedByItself '{'

notFollowedByItself c = try ( do x <- char c
                                 notFollowedBy $ char c
                                 return x)

doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")

getLinkTest = fmap getLink testList
    where testList = ["   [[rightLink]]   "                            --Correct link is found
                     , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"   --Correct link is found
                     , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found 

I have tried making the doubleCurlyBrac parser recursive to also discard links in nested curly braces, without success:

doubleCurlyBrac = between (string "{{") (string "}}") betweenBraces
        where betweenBraces = doubleCurlyBrac <|> (many $ try $ noneOf "}")

This parser stops consuming input after the first }}, rather than the final one, in a nested example. Is there an elegant way to write a recursive parser to (in this case) correctly ignore links in nested double curly braces? Also, can it be done without using try? I have found that since try does not consume input, it often causes the parser to hang on unexpected, ill-formed input.


Solution

  • Here's a more direct version that doesn't use a custom lexer. It does use try though, and I don't see how to avoid it here. The problem is that it seems we need a non-committing look ahead to distinguish double brackets from single brackets; try is for non-committing look ahead.

    The high level approach is that same as in my first answer. I've been careful to make the three node parsers commute -- making the code more robust to change -- by using both try and notFollowedBy:

    {-# LANGUAGE TupleSections #-}
    import Text.Parsec hiding (string)
    import qualified Text.Parsec
    import Control.Applicative ((<$>) , (<*) , (<*>))
    import Control.Monad (forM_)
    import Data.List (find)
    
    import Debug.Trace
    
    ----------------------------------------------------------------------
    -- Token parsers.
    
    llink , rlink , lbrace , rbrace :: Parsec String u String
    [llink , rlink , lbrace , rbrace] = reserved
    reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]
    
    ----------------------------------------------------------------------
    -- Node parsers.
    
    -- Link, braces, or string.
    data Node = L [Node] | B [Node] | S String deriving Show
    
    nodes :: Parsec String u [Node]
    nodes = many node
    
    node :: Parsec String u Node
    node = link <|> braces <|> string
    
    link , braces , string :: Parsec String u Node
    link   = L <$> between llink  rlink  nodes
    braces = B <$> between lbrace rbrace nodes
    string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)
    
    ----------------------------------------------------------------------
    
    parseNodes :: String -> Either ParseError [Node]
    parseNodes = parse (nodes <* eof) "<no file>"
    
    ----------------------------------------------------------------------
    -- Tests.
    
    getLink :: [Node] -> Maybe Node
    getLink = find isLink where
      isLink (L _) = True
      isLink _     = False
    
    parseLink :: String -> Either ParseError (Maybe Node)
    parseLink = either Left (Right . getLink) . parseNodes
    
    testList = [ "   [[rightLink]]   "
               , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"
               , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]"
               , " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
               -- Pathalogical example from comments.
               , "{{ab}cd}}"
               -- A more pathalogical example.
               , "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf" 
               -- No top level link.
               , "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
               -- Too many '{{'.
               , "{{ {{ {{ [[ asdf ]] }} }}"
               -- Too many '}}'.
               , "{{ {{ [[ asdf ]] }} }} }}"
               -- Too many '[['.
               , "[[ {{ [[{{[[asdf]]}}]]}}"
               ]
    
    main =
      forM_ testList $ \ t -> do
      putStrLn $ "Test: ^" ++ t ++ "$"
      let parses = ( , ) <$> parseNodes t <*> parseLink t
          printParses (n , l) = do
            putStrLn $ "Nodes: " ++ show n
            putStrLn $ "Link: " ++ show l
          printError = putStrLn . show
      either printError printParses parses
      putStrLn ""
    

    The output is the same in the non-error cases:

    Test: ^   [[rightLink]]   $
    Nodes: [S "   ",L [S "rightLink"],S "   "]
    Link: Just (L [S "rightLink"])
    
    Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
    Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
    Link: Just (L [S "rightLink"])
    
    Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
    Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
    Link: Just (L [S "rightLink"])
    
    Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
    Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
    Link: Just (L [B [L [S "someLink"]]])
    
    Test: ^{{ab}cd}}$
    Nodes: [B [S "ab}cd"]]
    Link: Nothing
    
    Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
    Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
    Link: Just (L [S "[asdfa"])
    
    Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
    Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
    Link: Nothing
    

    but the parse error messages are not as informative in the cases of unmatched openings:

    Test: ^{{ {{ {{ [[ asdf ]] }} }}$
    "<no file>" (line 1, column 26):
    unexpected end of input
    expecting "[[", "{{", "]]" or "}}"
    
    Test: ^{{ {{ [[ asdf ]] }} }} }}$
    "<no file>" (line 1, column 26):
    unexpected "}}"
    
    Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
    "<no file>" (line 1, column 25):
    unexpected end of input
    expecting "[[", "{{", "]]" or "}}"
    

    and I couldn't figure out how to fix them.