Search code examples
parsinghaskellparsec

How can I parse non-contiguous elements of a permutation phrase using Parsec's Perm?


Haskell's Parsec supports parsing of permutation phrases via the Perm module. In the documentation for permute you can see how the various permutations "abc", "aab", "bc", "cbaaaaa", etc. can be parsed. While the example shows support for parting many contiguous instances of the same element like "aaaa", it won't parse non-contiguous instances like "aabca", presumably because each parser is included only once in each permutation (the paper seems to imply this in the tree...)

Besides sorting the input so like instances are contiguous, what options do I have for parsing non-contiguous instance?


Solution

  • Depending on what you actually want, you may be able to use many $ oneOf ['a','b','c'].

    If you really need to use the permutation parser, keep in mind that allowing parses of multiple adjacent characters introduces ambiguity. For example, in the string "bacacbbca", it could be parsed as the perms bac, acb bca, or, if you allow repeated characters, bac, acbb, with a leftover non-permutation of ca.

    If you allow repeated letters,

    {-# LANGUAGE FlexibleContexts #-}
    
    import Text.Parsec
    import Text.Parsec.Char
    import Text.Parsec.Perm
    import Data.Text
    import Control.Monad.Identity
    
    perm :: (Stream s Identity Char) => Parsec s u (String,String,String)
    perm = permute $ triple <$$>
      (many1 $ char 'a') <||>
      (many1 $ char 'b') <||>
      (many1 $ char 'c')
      where triple a b c = (a,b,c)
    
    multiPerm :: (Stream s Identity Char) => Parsec s u [(String,String,String)]
    multiPerm = many $ try $ perm
    
    main :: IO ()
    main = parseTest multiPerm $ "bacacbbca" 
    

    main produces [("a","b","c"),("a","bb","c")].

    If not:

    perm :: (Stream s Identity Char) => Parsec s u (Char,Char,Char)
    perm = permute $ triple <$$>
      (char 'a') <||>
      (char 'b') <||>
      (char 'c')
      where triple a b c = (a,b,c)
    

    you get the arguably better: [('a','b','c'),('a','b','c'),('a','b','c')].