I'm trying to grab 5 lines by their line numbers from a large (> 1GB) file with Clojure. I'm almost there but am seeing some strange things, and I want to understand what's going on.
So far I've got:
(defn multi-nth [values indices]
(map (partial nth values) indices))
(defn read-lines [file indices]
(with-open [rdr (clojure.java.io/reader file)]
(let [lines (line-seq rdr)]
(multi-nth lines indices))))
Now, (read-lines "my-file" [0])
works without a problem. However, passing in [0 1]
gives me the following stacktrace:
java.lang.RuntimeException: java.io.IOException: Stream closed
Util.java:165 clojure.lang.Util.runtimeException
LazySeq.java:51 clojure.lang.LazySeq.sval
LazySeq.java:60 clojure.lang.LazySeq.seq
Cons.java:39 clojure.lang.Cons.next
RT.java:769 clojure.lang.RT.nthFrom
RT.java:742 clojure.lang.RT.nth
core.clj:832 clojure.core/nth
AFn.java:163 clojure.lang.AFn.applyToHelper
AFn.java:151 clojure.lang.AFn.applyTo
core.clj:602 clojure.core/apply
core.clj:2341 clojure.core/partial[fn]
RestFn.java:408 clojure.lang.RestFn.invoke
core.clj:2430 clojure.core/map[fn]
It seems that the stream is being closed before I can read the second line from the file. Interestingly, if I manually pull out a line from the file with something like (nth lines 200)
, the multi-nth
call works for all values <= 200.
Any idea what's going on?
map (and line-seq) return lazy sequences, so none of the lines are necessarily read by the time your call to with-open returns, which closes the file.
basically, you need to realize the whole return value before with-open returns, for which you can use doall:
(defn multi-nth [values indices]
(map (partial nth values) indices))
(defn read-lines [file indices]
(with-open [rdr (clojure.java.io/reader file)]
(let [lines (line-seq rdr)]
(doall (multi-nth lines indices)))))
or something like that. keep in mind that your multi-nth holds on to the head of the line seq while searching for the specified lines, which means it'll keep all of the lines up until the last specified one in memory - and using nth like that means you're stepping through the line-seq repeatedly for each index - you'll want to fix that.
update:
Something like this will work. It's a little uglier than I like but it shows the principle, I think: Note that indices here needs to be a set.
(defn multi-nth [values indices]
(keep
(fn [[number line]]
(if (contains? indices number)
line))
(map-indexed vector values)))
(multi-nth '(a b c d e) #{2 3})
=> c d