Search code examples
stringhaskellrecursiontreesuffix-tree

Building a suffix tree by inserting each suffix in Haskell


I am working with the following data type:

data SuffixTree = Leaf Int | Node [(String, SuffixTree)] 
                deriving (Eq, Show)

Each subtree has a corresponding label (string). The idea is to build the corresponding suffix tree by adding each suffix and its index into an accumulating tree (at the beginning it is Node []).

This is already defined

buildTree s
    = foldl (flip insert) (Node []) (zip (suffixes s) [0..length s-1]) 

where suffixes is correctly defined.

I've been trying to implement the insert function for a while but can't seem to succeed.

This is what I have now (the names and style are not the best since this is still work in progress):

insert :: (String, Int) -> SuffixTree -> SuffixTree
insert pair tree@(Node content) 
  = insert' pair tree content
  where
    insert' :: (String, Int) -> SuffixTree -> [(String, SuffixTree)] -> SuffixTree
    insert' (s, n) (Node []) subtrees
      = Node ((s, Leaf n) : subtrees)
    insert' (s, n) (Node content@((a, tree) : pairs)) subtrees
      | null p = insert' (s, n) (Node pairs) subtrees
      | p == a = insert' (r, n) tree subtrees
      | p /= a = Node ((p, newNode) : (subtrees \\ [(a, tree)]))
      where
        (p, r, r')  = partition s a
        newNode     = Node [(r, (Leaf n)), (r', tree)]

The partition function takes two strings and returns a tuple consisting of:

  1. The common prefix (if it exists)
  2. The first string without the prefix
  3. The second string without the prefix

I think I understand the rules needed to build the tree.

We start by comparing the label of the first subtree to the string we want to insert (say, str). If they don't have a prefix in common, we try to insert in the next subtree.

If the label is a prefix of str, we continue to look into that subtree, but instead of using str we try to insert str without the prefix.

If str is a prefix of label, then we replace the existing subtree with a new Node, having a Leaf and the old subtree. We also adjust the labels.

If we don't have a match between str and any label then we add a new Leaf to the list of subtrees.

However, the biggest problem that I have is that I need to return a new tree containing the changes, so I have to keep track of everything else in the tree (not sure how to do this or if I'm thinking correctly about this).

The code appears to be working correctly on this string: "banana":

Node [("a",Node [("",Leaf 5),("na",Node [("",Leaf 3),("na",Leaf 1)])]),
("na",Node [("",Leaf 4),("na",Leaf 2)]),("banana",Leaf 0)]

However, on this string "mississippi" I get an Exception: Non-exhaustive patterns in function insert'.

Any help or ideas are greatly appreciated!


Solution

  • You are using a quadratic algorithm; whereas optimally, suffix tree can be constructed in linear time. That said, sticking with the same algorithm, a possibly better approach would be to first build the (uncompressed) suffix trie (not tree) and then compress the resulting trie.

    The advantage would be that a suffix trie can be represented using Data.Map:

    data SuffixTrie
      = Leaf' Int
      | Node' (Map (Maybe Char) SuffixTrie)
    

    which makes manipulations both more efficient and easier than list of pairs. Doing so, you may also completely bypass common prefix calculations, as it comes out by itself:

    import Data.List (tails)
    import Data.Maybe (maybeToList)
    import Control.Arrow (first, second)
    import Data.Map.Strict (Map, empty, insert, insertWith, assocs)
    
    data SuffixTree
      = Leaf Int
      | Node [(String, SuffixTree)]
      deriving Show
    
    data SuffixTrie
      = Leaf' Int
      | Node' (Map (Maybe Char) SuffixTrie)
    
    buildTrie :: String -> SuffixTrie
    buildTrie s = foldl go (flip const) (init $ tails s) (length s) $ Node' empty
      where
      go run xs i (Node' ns) = run (i - 1) $ Node' tr
        where tr = foldr loop (insert Nothing $ Leaf' (i - 1)) xs ns
      loop x run = insertWith (+:) (Just x) . Node' $ run empty
        where _ +: Node' ns = Node' $ run ns
    
    buildTree :: String -> SuffixTree
    buildTree = loop . buildTrie
      where
      loop (Leaf' i) = Leaf i
      loop (Node' m) = Node $ con . second loop <$> assocs m
      con (Just x, Node [(xs, tr)]) = (x:xs, tr) -- compress single-child nodes
      con n = maybeToList `first` n
    

    then:

    \> buildTree "banana"
    Node [("a",Node [("",Leaf 5),
                     ("na",Node [("",Leaf 3),
                                 ("na",Leaf 1)])]),
          ("banana",Leaf 0),
          ("na",Node [("",Leaf 4),
                      ("na",Leaf 2)])]
    

    similarly:

    \> buildTree "mississippi"
    Node [("i",Node [("",Leaf 10),
                     ("ppi",Leaf 7),
                     ("ssi",Node [("ppi",Leaf 4),
                                  ("ssippi",Leaf 1)])]),
          ("mississippi",Leaf 0),
          ("p",Node [("i",Leaf 9),
                     ("pi",Leaf 8)]),
          ("s",Node [("i",Node [("ppi",Leaf 6),
                                ("ssippi",Leaf 3)]),
                     ("si",Node [("ppi",Leaf 5),
                                 ("ssippi",Leaf 2)])])]