Search code examples
regexhaskellgraph-theoryabstract-syntax-tree

Simplify parsed regex


I have to simplify custom regex expressions parsed to a certain data type. With "simplify" I mean the following (emphasis mine):

Given the rules:

  • lowercase letters match themselves, eg.: a matches a and nothing else
  • parens enclosing only letters match their full sequence, eg.: (abc) matches abc and nothing else
  • square brackets enclosing only letters match every letters inside, eg.: [abc] matches a and b and c and nothing else

The following are all valid:

  • (a[bc]) matches ab and ac and nothing else
  • [a(bc)] matches a and bc and nothing else
  • (a(bc)) is the same as (abc) and matches abc and nothing else
  • [a[bc]] is the same as [abc] and matches a and b and c and nothing else

Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)] is really just the same as [abcd] which matches a, b, c and d.

I have implemented a simple parser combinator in Haskell using attoparsec and the following destination data type:

data Regex
  = Symbol Char
  | Concat [Regex] -- ()
  | Union [Regex] -- []
  deriving (Eq)

However, I'm really struggling with the simplification part. I try to reduce the Concats and Unions by a combination of unwrapping them, nubbing and concatMapping to no avail. I think that the data type I have defined might not be the best fit but I have run out of ideas (late at night here). Could you help me look to the right direction? Thanks!

simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs

Solution

  • You are missing a couple simple improvements, for starters. simplify (Concat [x]) = x and likewise for Union: there's no need for the wrapped regex to be specifically a symbol.

    Then you need to start looking at Concats containing other Concats, and likewise for Union. Sure, you start by simplifying the elements of the wrapped list, but before jamming the result back into a wrapper, you lift up any elements using the same wrapper. Something like:

    simplify (Concat xs) = 
      case concatMap liftConcats (map simplify xs) of
        [x] -> x
        xs -> Concat xs
      where liftConcats :: Regex -> [Regex]
            liftConcats r = _exerciseForTheReader
    

    Then you can do something similar for Union, with a nub thrown in as well.