Search code examples
clojureheap-memoryinputstream

Huge file in Clojure and Java heap space error


I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.

My code (partly taken from Justin Kramer answer) looks like that:

(defn process-pages
  [page]
  (let [title     (article-title page)
        revisions (filter #(= :revision (:tag %)) (:content page))]
    (for [revision revisions]
      (let [user (revision-user revision)
            time (revision-timestamp revision)]
        (spit "files/data.csv"
              (str "\"" time "\";\"" user "\";\"" title "\"\n" )
              :append true)))))

(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
  (->> (:content (data.xml/parse rdr :coalescing false))
       (filter #(= :page (:tag %)))
       (map process-pages))))

I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.


Solution

  • Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.

    That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.

    Here's what I recommend:

    1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.

    2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.

    As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.

    UPDATE:

    Here's a rewrite which attempts to separate concerns more clearly:

    (defn filter-tag [tag xml]
      (filter #(= tag (:tag %)) xml))
    
    ;; lazy
    (defn revision-seq [xml]
      (for [page (filter-tag :page (:content xml))
            :let [title (article-title page)]
            revision (filter-tag :revision (:content page))
            :let [user (revision-user revision)
                  time (revision-timestamp revision)]]
        [time user title]))
    
    ;; eager
    (defn transform [in out]
      (with-open [r (io/input-stream in)
                  w (io/writer out)]
        (binding [*out* out]
          (let [xml (data.xml/parse r :coalescing false)]
            (doseq [[time user title] (revision-seq xml)]
              (println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))
    
    (transform "dump.xml" "data.csv")
    

    I don't see anything here that would cause excessive memory use.