I have to simplify custom regex expressions parsed to a certain data type. With "simplify" I mean the following (emphasis mine):
Given the rules:
- lowercase letters match themselves, eg.:
a
matchesa
and nothing else- parens enclosing only letters match their full sequence, eg.:
(abc)
matchesabc
and nothing else- square brackets enclosing only letters match every letters inside, eg.:
[abc]
matchesa
andb
andc
and nothing elseThe following are all valid:
(a[bc])
matchesab
andac
and nothing else[a(bc)]
matchesa
andbc
and nothing else(a(bc))
is the same as(abc)
and matchesabc
and nothing else[a[bc]]
is the same as[abc]
and matchesa
andb
andc
and nothing elseRegexes can be simplified. For example
[a[[bb]b[[b]]](c)(d)]
is really just the same as[abcd]
which matchesa
,b
,c
andd
.
I have implemented a simple parser combinator in Haskell using attoparsec
and the following destination data type:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
However, I'm really struggling with the simplification part. I try to reduce the Concat
s and Union
s by a combination of unwrapping them, nubbing and concatMapping to no avail. I think that the data type I have defined might not be the best fit but I have run out of ideas (late at night here). Could you help me look to the right direction? Thanks!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs
You are missing a couple simple improvements, for starters. simplify (Concat [x]) = x
and likewise for Union: there's no need for the wrapped regex to be specifically a symbol.
Then you need to start looking at Concat
s containing other Concat
s, and likewise for Union
. Sure, you start by simplifying the elements of the wrapped list, but before jamming the result back into a wrapper, you lift up any elements using the same wrapper. Something like:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
Then you can do something similar for Union, with a nub
thrown in as well.