Search code examples
clojurelazy-evaluationchunkinglazy-sequences

In Clojure, are lazy seqs always chunked?


I was under the impression that the lazy seqs were always chunked.

=> (take 1 (map #(do (print \.) %) (range)))
(................................0)

As expected 32 dots are printed because the lazy seq returned by range is chunked into 32 element chunks. However, when instead of range I try this with my own function get-rss-feeds, the lazy seq is no longer chunked:

=> (take 1 (map #(do (print \.) %) (get-rss-feeds r)))
(."http://wholehealthsource.blogspot.com/feeds/posts/default")

Only one dot is printed, so I guess the lazy-seq returned by get-rss-feeds is not chunked. Indeed:

=> (chunked-seq? (seq (range)))
true

=> (chunked-seq? (seq (get-rss-feeds r)))
false

Here is the source for get-rss-feeds:

(defn get-rss-feeds
  "returns a lazy seq of urls of all feeds; takes an html-resource from the enlive library"
  [hr]
  (map #(:href (:attrs %))
       (filter #(rss-feed? (:type (:attrs %))) (html/select hr [:link])))

So it appears that chunkiness depends on how the lazy seq is produced. I peeked at the source for the function range and there are hints of it being implemented in a "chunky" manner. So I'm a bit confused as to how this works. Can someone please clarify?


Here's why I need to know.

I have to following code: (get-rss-entry (get-rss-feeds h-res) url)

The call to get-rss-feeds returns a lazy sequence of URLs of feeds that I need to examine.

The call to get-rss-entry looks for a particular entry (whose :link field matches the second argument of get-rss-entry). It examines the lazy sequence returned by get-rss-feeds. Evaluating each item requires an http request across the network to fetch a new rss feed. To minimize the number of http requests it's important to examine the sequence one-by-one and stop as soon as there is a match.

Here is the code:

(defn get-rss-entry
  [feeds url]
  (ffirst (drop-while empty? (map #(entry-with-url % url) feeds))))

entry-with-url returns a lazy sequence of matches or an empty sequence if there is no match.

I tested this and it seems to work correctly (evaluating one feed url at a time). But I am worried that somewhere, somehow it will start behaving in a "chunky" way and it will start evaluating 32 feeds at a time. I know there is a way to avoid chunky behavior as discussed here, but it doesn't seem to even be required in this case.

Am I using lazy seq non-idiomatically? Would loop/recur be a better option?


Solution

  • Depending on the vagueness of Chunking seems unwise as you mention above. Explicitly "un chunking" in cases where you really need it not to be chunked is also wise because then if at some other point your code changes in a way that chunkifies it things wont break. On another note, if you need actions to be sequential, agents are a great tool you could send the download functions to an agent then they will be run one at a time and only once regardless of how you evaluate the function. At some point you may want to pmap your sequence and then even un-chunking will not work though using an atom will continue to work correctly.