Search code examples
clojurelazy-evaluationstream-processing

Lazily extract lines from large file


I'm trying to grab 5 lines by their line numbers from a large (> 1GB) file with Clojure. I'm almost there but am seeing some strange things, and I want to understand what's going on.

So far I've got:

(defn multi-nth [values indices]
  (map (partial nth values) indices))

(defn read-lines [file indices]
  (with-open [rdr (clojure.java.io/reader file)]
    (let [lines (line-seq rdr)]
      (multi-nth lines indices))))

Now, (read-lines "my-file" [0]) works without a problem. However, passing in [0 1] gives me the following stacktrace:

java.lang.RuntimeException: java.io.IOException: Stream closed
        Util.java:165 clojure.lang.Util.runtimeException
      LazySeq.java:51 clojure.lang.LazySeq.sval
      LazySeq.java:60 clojure.lang.LazySeq.seq
         Cons.java:39 clojure.lang.Cons.next
          RT.java:769 clojure.lang.RT.nthFrom
          RT.java:742 clojure.lang.RT.nth
         core.clj:832 clojure.core/nth
         AFn.java:163 clojure.lang.AFn.applyToHelper
         AFn.java:151 clojure.lang.AFn.applyTo
         core.clj:602 clojure.core/apply
        core.clj:2341 clojure.core/partial[fn]
      RestFn.java:408 clojure.lang.RestFn.invoke
        core.clj:2430 clojure.core/map[fn]

It seems that the stream is being closed before I can read the second line from the file. Interestingly, if I manually pull out a line from the file with something like (nth lines 200), the multi-nth call works for all values <= 200.

Any idea what's going on?


Solution

  • map (and line-seq) return lazy sequences, so none of the lines are necessarily read by the time your call to with-open returns, which closes the file.

    basically, you need to realize the whole return value before with-open returns, for which you can use doall:

    (defn multi-nth [values indices]
      (map (partial nth values) indices))
    
    (defn read-lines [file indices]
      (with-open [rdr (clojure.java.io/reader file)]
        (let [lines (line-seq rdr)]
          (doall (multi-nth lines indices)))))
    

    or something like that. keep in mind that your multi-nth holds on to the head of the line seq while searching for the specified lines, which means it'll keep all of the lines up until the last specified one in memory - and using nth like that means you're stepping through the line-seq repeatedly for each index - you'll want to fix that.

    update:

    Something like this will work. It's a little uglier than I like but it shows the principle, I think: Note that indices here needs to be a set.

    (defn multi-nth [values indices]
     (keep 
       (fn [[number line]] 
         (if (contains? indices number) 
           line))
       (map-indexed vector values)))
    
    (multi-nth '(a b c d e) #{2 3})
      => c d