Transposing / pivoting rows to columns in Cascalog?

Let's say I have a set of tuples to be processed by Cascalog, formatted like [Date, Name, Value], e.g.

2014-01-01 Pizza        3
2014-01-01 Hamburger    4
2014-01-01 Cheeseburger 2
2014-01-02 Pizza        1
2014-01-02 Hamburger    2

Given that I have a list of columns like [Pizza, Hamburger, Cheeseburger], I want to transpose / pivot the data so it looks like this:

Date       Pizza Hamburger Cheeseburger
2014-01-01 3     4         2
2014-01-02 1     2         0

What's the best way to do this in Cascalog?

Solution

Here's one way to do it:

(:use cascalog.api)

(def input
  [["2014-01-01" "Pizza"        3]
   ["2014-01-01" "Hamburger"    4]
   ["2014-01-01" "Cheeseburger" 2]
   ["2014-01-02" "Pizza"        1]
   ["2014-01-02" "Hamburger"    2]])

(defn init-aggregate [k v]
  {k v})

(def combine-aggregate
  (partial merge-with +))

(defparallelagg aggregate
  :init-var #'init-aggregate
  :combine-var #'combine-aggregate) 

(defn select-values [hashmap keyseq]
  (map #(get hashmap %) keyseq))

(def columns
  ["Pizza" "Hamburger" "Cheeseburger"])

(defn transpose [data]
  (<- [?date !pizza !hamburger !cheeseburger]
      ((<- [?date ?sum]
           (data ?date ?name ?value)
           (aggregate ?name ?value :> ?sum))
         ?date ?sum)
      (select-values ?sum columns :> !pizza !hamburger !cheeseburger)))

(?- (stdout) (transpose input))

Let's have a quick run through the code:

Most of the action happens in the transpose function, which contains two queries:

The inner query aggregates all ?name ?value pairs for a given date into a ?sum map.
The outer query uses select-values to fetch the values for our columns out of the ?sum map, and into the final result rows.

Since we know the columns are Pizza, Hamburger, Cheeseburger we can simply hardcode them into the query. If you want to know how to make the columns dynamic, read Nathan Marz's blog post on creating a news feed in Cascalog.

Note that we have to represent the columns as nullable variables (using !) since not every column will have a value for any given row. If we wanted to avoid null results, we could change select-values to use 0 as the default value.

(One caveat is that this won't produce any headers in the final output, so this has to be done as a post-processing step.)