I am working with the following data type:
data SuffixTree = Leaf Int | Node [(String, SuffixTree)]
deriving (Eq, Show)
Each subtree has a corresponding label (string).
The idea is to build the corresponding suffix tree by adding each suffix and its index into an accumulating tree (at the beginning it is Node []
).
This is already defined
buildTree s
= foldl (flip insert) (Node []) (zip (suffixes s) [0..length s-1])
where suffixes
is correctly defined.
I've been trying to implement the insert
function for a while but can't seem to succeed.
This is what I have now (the names and style are not the best since this is still work in progress):
insert :: (String, Int) -> SuffixTree -> SuffixTree
insert pair tree@(Node content)
= insert' pair tree content
where
insert' :: (String, Int) -> SuffixTree -> [(String, SuffixTree)] -> SuffixTree
insert' (s, n) (Node []) subtrees
= Node ((s, Leaf n) : subtrees)
insert' (s, n) (Node content@((a, tree) : pairs)) subtrees
| null p = insert' (s, n) (Node pairs) subtrees
| p == a = insert' (r, n) tree subtrees
| p /= a = Node ((p, newNode) : (subtrees \\ [(a, tree)]))
where
(p, r, r') = partition s a
newNode = Node [(r, (Leaf n)), (r', tree)]
The partition
function takes two strings and returns a tuple consisting of:
I think I understand the rules needed to build the tree.
We start by comparing the label of the first subtree to the string we want to insert (say, str
). If they don't have a prefix in common, we try to insert in the next subtree.
If the label is a prefix of str
, we continue to look into that subtree, but instead of using str
we try to insert str
without the prefix.
If str
is a prefix of label, then we replace the existing subtree with a new Node
, having a Leaf
and the old subtree. We also adjust the labels.
If we don't have a match between str
and any label then we add a new Leaf
to the list of subtrees.
However, the biggest problem that I have is that I need to return a new tree containing the changes, so I have to keep track of everything else in the tree (not sure how to do this or if I'm thinking correctly about this).
The code appears to be working correctly on this string: "banana"
:
Node [("a",Node [("",Leaf 5),("na",Node [("",Leaf 3),("na",Leaf 1)])]),
("na",Node [("",Leaf 4),("na",Leaf 2)]),("banana",Leaf 0)]
However, on this string "mississippi"
I get an Exception: Non-exhaustive patterns in function insert'
.
Any help or ideas are greatly appreciated!
You are using a quadratic algorithm; whereas optimally, suffix tree can be constructed in linear time. That said, sticking with the same algorithm, a possibly better approach would be to first build the (uncompressed) suffix trie (not tree) and then compress the resulting trie.
The advantage would be that a suffix trie can be represented using Data.Map
:
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
which makes manipulations both more efficient and easier than list of pairs. Doing so, you may also completely bypass common prefix calculations, as it comes out by itself:
import Data.List (tails)
import Data.Maybe (maybeToList)
import Control.Arrow (first, second)
import Data.Map.Strict (Map, empty, insert, insertWith, assocs)
data SuffixTree
= Leaf Int
| Node [(String, SuffixTree)]
deriving Show
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
buildTrie :: String -> SuffixTrie
buildTrie s = foldl go (flip const) (init $ tails s) (length s) $ Node' empty
where
go run xs i (Node' ns) = run (i - 1) $ Node' tr
where tr = foldr loop (insert Nothing $ Leaf' (i - 1)) xs ns
loop x run = insertWith (+:) (Just x) . Node' $ run empty
where _ +: Node' ns = Node' $ run ns
buildTree :: String -> SuffixTree
buildTree = loop . buildTrie
where
loop (Leaf' i) = Leaf i
loop (Node' m) = Node $ con . second loop <$> assocs m
con (Just x, Node [(xs, tr)]) = (x:xs, tr) -- compress single-child nodes
con n = maybeToList `first` n
then:
\> buildTree "banana"
Node [("a",Node [("",Leaf 5),
("na",Node [("",Leaf 3),
("na",Leaf 1)])]),
("banana",Leaf 0),
("na",Node [("",Leaf 4),
("na",Leaf 2)])]
similarly:
\> buildTree "mississippi"
Node [("i",Node [("",Leaf 10),
("ppi",Leaf 7),
("ssi",Node [("ppi",Leaf 4),
("ssippi",Leaf 1)])]),
("mississippi",Leaf 0),
("p",Node [("i",Leaf 9),
("pi",Leaf 8)]),
("s",Node [("i",Node [("ppi",Leaf 6),
("ssippi",Leaf 3)]),
("si",Node [("ppi",Leaf 5),
("ssippi",Leaf 2)])])]