Search code examples
clojuretree-traversallazy-sequenceszipper

Printing a tree lazily in Newick format


I wish to print a binary tree in Newick format, showing each node's distance to its parent. At the moment I haven't had an issue with the following code, which uses regular recursion, but a tree too deep may produce a stack overflow.

(defn tree->newick
  [tree]
  (let [{:keys [id children to-parent]} tree
        dist (double to-parent)] ; to-parent may be a rational
    (if children
      (str "(" (tree->newick (first children)) 
           "," (tree->newick (second children)) 
           "):" dist)
      (str (name id) ":" dist))))

(def example {:id nil :to-parent 0.0 
              :children [{:id nil :to-parent 0.5 
                          :children [{:id "A" :to-parent 0.3 :children nil}
                                     {:id "B" :to-parent 0.2 :children nil}]}
                         {:id "C" :to-parent 0.8 :children nil}]})

(tree->newick example)
;=> "((A:0.3,B:0.2):0.5,C:0.8):0.0"

(def linear-tree (->> {:id "bottom" :to-parent 0.1 :children nil}
                   (iterate #(hash-map :id nil :to-parent 0.1 
                                       :children [% {:id "side" :to-parent 0.1 :children nil}]))
                   (take 10000)
                   last))

(tree->newick linear-tree)
;=> StackOverflowError

The problem I've found with current utilities, such as tree-seq and clojure.walk, is that I have to visit an inner node more than once, to interpose the comma and close the bracket. I've used clojure.zip, but didn't manage to write a lazy/tail-recursive implementation, as I would need to store for each inner node how many times they were already visited.


Solution

  • Here's a version that works on your linear-tree example. It's a direct conversion of your implementation with two changes: it uses continuation passing style and the trampoline.

    (defn tree->newick
      ([tree]
         (trampoline tree->newick tree identity))
      ([tree cont]
         (let [{:keys [id children to-parent]} tree
               dist (double to-parent)]     ; to-parent may be a rational
           (if children
             (fn []
               (tree->newick
                (first children)
                (fn [s1] (fn []
                           (tree->newick
                            (second children)
                            (fn [s2] (cont (str "(" s1 "," s2 "):" dist))))))))
             (cont (str (name id) ":" dist))))))
    

    Edit: added pattern matching to allow calling the function in a simple way.

    Edit 2: I noticed that I made mistake. The problem is that I did take the fact that Clojure doesn't optimize tail calls only partially into account.

    The core idea of my solution is the transformation into continuation passing style so the recursive calls can be moved into tail position (i.e. instead of returning their result, the recursive calls pass it to the continuation as argument).

    Then I hand-optimized the recursive calls by making them use the trampoline. What I forgot to consider is that the calls of the continuations - which are not recursive calls but also in tail position - need to be optimized, too, because the tail calls can be a very long chain of closures, so that when the function finally evaluates them, it becomes a long chain of calls.

    This problem did not materialize with the the test data linear-tree because the continuation for the first child returns to the trampoline to process recursive call for the second child. But if linear-tree is changed so that it uses the second child of every node to build the linear tree instead of the first child, this does again cause a stack overflow.

    So the calls of the continuations need to return to the trampoline, too. (Actually, the call in the no children base case does not, because it will happen at most once before returning to trampoline, and the same will then be true for the second recursive call.) So here's an implementation that does take this into consideration and should use only constant stack space on all inputs:

    (defn tree->newick
      ([tree]
         (trampoline tree->newick tree identity))
      ([tree cont]
         (let [{:keys [id children to-parent]} tree
               dist (double to-parent)]     ; to-parent may be a rational
           (if children
             (fn [] (tree->newick
                     (first children)
                     (fn [s1] (tree->newick
                               (second children)
                               (fn [s2] #(cont (str "(" s1 "," s2 "):" dist)))))))
             (cont (str (name id) ":" dist))))))