Search code examples
xmlclojuretreetruncatezipper

Clojure XML zipper walk and prune


I am walking a html/xml data structure. I walk through it using clojure.zip. Once I find a node at which I want to cut (prune), I cannot find a way to remove all children and right nodes.

Example:

Let's say I have this tree (that represents html):

(def tree [:p "F"
           [:p "G" [:p "I" [:p "H"]]]
           [:p "B"
            [:p
             "D"
             [:p "E"]
             [:p "C"]]
            [:p "A"]]])

I parse it, xml-zip it, and at some point while walking I end up in node "D", where I want to cut. I now need to return the root without "E", "C" (children) and "D". Those are all the nodes that are not visited already when using next at this point.

How would I remove these nodes?

Note: if this is not feasible, I would also welcome an approach that copies the zipper up to the cut point.

Example data: This is the parsed data that I have for the above tree, on which I call xml-zip:

{:tag :html, :attrs nil, :content [{:tag :head, :attrs nil, :content nil} {:tag :body, :attrs nil, :content [{:tag :p, :attrs nil, :content ["F"]} {:tag :p, :attrs nil, :content ["G"]} {:tag :p, :attrs nil, :content ["I"]} {:tag :p, :attrs nil, :content ["H"]} {:tag :p, :attrs nil, :content nil} {:tag :p, :attrs nil, :content nil} {:tag :p, :attrs nil, :content ["B"]} {:tag :p, :attrs nil, :content ["D"]} {:tag :p, :attrs nil, :content ["E"]} {:tag :p, :attrs nil, :content ["C"]} {:tag :p, :attrs nil, :content nil} {:tag :p, :attrs nil, :content ["A"]} {:tag :p, :attrs nil, :content nil} {:tag :p, :attrs nil, :content nil}]}]}

And I start walking through it like so to get the content:

(-> parsed (z/xml-zip)
           (z/down) ;head
           (z/right) ; body
           (z/down) ; content
           )

Another example:

the following string: "<article><h1><img href=\"some-url\"></img> some-text <b>in bold</b></h1><ul><li> AA </li> <li>BB</li></ul></article>" will give me the following map:

[{:tag :html, :attrs nil, :content [{:tag :head, :attrs nil, :content nil} {:tag :body, :attrs nil, :content [{:tag :article, :attrs nil, :content [{:tag :h1, :attrs nil, :content [{:tag :img, :attrs {:href "some-url"}, :content nil} " some-text " {:tag :b, :attrs nil, :content ["in bold"]}]} {:tag :ul, :attrs nil, :content [{:tag :li, :attrs nil, :content [" AA "]} " " {:tag :li, :attrs nil, :content ["BB"]}]}]}]}]} nil]

when cutting at "some-text" it should ultimately result in the string <article><h1><img href=\"some-url\"></img> some-text</h1></article>


Solution

  • Firs of all, i would rephrase your task the following way:

    The goal is to find some node, and then remove it and everything to the right of it from it's parent.

    Stated this way, the cut function can be easily implemented with the help of clojure.zip/edit for parent:

    (defn cut [loc]
      (when-let [parent (z/up loc)]
        (z/edit parent #(z/make-node loc % (z/lefts loc)))))
    

    so, as it was said above, we edit the parent of the loc, making the new node of it, keeping only the children to the left of the loc.

    notice, that there is the when-let macro there, to avoid the null pointer exception if the passed location doesn't have parent (meaning it is the root of the zipper)

    Now the test:

    let's try to delete the p containing ["I"]:

    user> (-> html
              z/xml-zip
              z/down
              z/right
              z/down
              z/right
              z/right
              z/node)
    ;; {:tag :p, :attrs nil, :content ["I"]}
    
    user> (-> html
              z/xml-zip
              z/down
              z/right
              z/down
              z/right
              z/right
              cut
              z/root)
    ;;{:tag :html, :attrs nil, 
    ;; :content [{:tag :head, :attrs nil, :content nil} 
    ;;           {:tag :body, :attrs nil, 
    ;;            :content [{:tag :p, :attrs nil, :content ["F"]} 
    ;;                      {:tag :p, :attrs nil, :content ["G"]}]}]}
    

    as expected: everything to the right from (and including) I was removed from the body.

    update

    according to the update, you want to remove all the nodes in the tree after the target one. This is a little bit more tricky, since it requires changing all the node's parents up to the root. In this case the cut function could look like this:

    (defn cut [loc]
      (loop [loc loc]
        (if-let [parent (z/up loc)]
          (recur
           (z/replace parent
                      (z/make-node loc
                                   (z/node parent)
                                   (drop-last (count (z/rights loc))
                                              (z/children parent)))))
          (z/node loc))))
    

    test:

    user> (-> h2 
              z/xml-zip 
              z/down 
              z/right 
              z/down 
              z/down 
              z/down 
              z/right 
              cut)
    
    ;;{:tag :html, :attrs nil, 
    ;; :content [{:tag :head, :attrs nil, :content nil} 
    ;;           {:tag :body, :attrs nil, 
    ;;            :content [{:tag :article, :attrs nil, 
    ;;                       :content [{:tag :h1, :attrs nil, 
    ;;                                  :content [{:tag :img, :attrs {:href "some-url"}, :content nil} " some-text "]}]}]}]}