Search code examples
csvclojureout-of-memorylarge-filescompojure

How to stream a large CSV response from a compojure API so that the whole response is not held in memory at once?


I'm new to using compojure, but have been enjoying using it so far. I'm currently encountering a problem in one of my API endpoints that is generating a large CSV file from the database and then passing this as the response body.

The problem I seem to be encountering is that the whole CSV file is being kept in memory which is then causing an out of memory error in the API. What is the best way to handle and generate this, ideally as a gzipped file? Is it possible to stream the response so that a few thousand rows are returned at a time? When I return a JSON response body for the same data, there is no problem returning this.

Here is the current code I'm using to return this:

(defn complete
  "Returns metrics for each completed benchmark instance"
  [db-client response-format]
  (let [benchmarks  (completed-benchmark-metrics {} db-client)]
    (case response-format
      :json  (json-grouped-output field-mappings benchmarks)
      :csv   (csv-output benchmarks))))

(defn csv-output [data-seq]
  (let [header (map name (keys (first data-seq)))
        out    (java.io.StringWriter.)
        write  #(csv/write-csv out (list %))]
    (write header)
    (dorun (map (comp write vals) data-seq))
    (.toString out)))

The data-seq is the results returned from the database, which I think is a lazy sequence. I'm using yesql to perform the database call.

Here is my compojure resource for this API endpoint:

(defresource results-complete [db]
  :available-media-types  ["application/json" "text/csv"]
  :allowed-methods        [:get]
  :handle-ok              (fn [request]
                            (let [response-format (keyword (get-in request [:request :params :format] :json))
                                  disposition     (str "attachment; filename=\"nucleotides_benchmark_metrics." (name response-format) "\"")
                                  response        {:headers {"Content-Type" (content-types response-format)
                                                             "Content-Disposition" disposition}
                                                   :body    (results/complete db response-format)}]
                              (ring-response response))))

Solution

  • Thanks to all the suggestion that were provided in this thread, I was able to create a solution using piped-input-stream:

    (defn csv-output [data-seq]
      (let [headers     (map name (keys (first data-seq)))
            rows        (map vals data-seq)
            stream-csv  (fn [out] (csv/write-csv out (cons headers rows))
                                  (.flush out))]
        (piped-input-stream #(stream-csv (io/make-writer % {})))))
    

    This differs from my solution because it does not realise the sequence using dorun and does not create a large String object either. This instead writes to a PipedInputStream connection asynchronously as described by the documentation:

    Create an input stream from a function that takes an output stream as its argument. The function will be executed in a separate thread. The stream will be automatically closed after the function finishes.