Let's say I have a set of tuples to be processed by Cascalog, formatted like [Date, Name, Value]
, e.g.
2014-01-01 Pizza 3
2014-01-01 Hamburger 4
2014-01-01 Cheeseburger 2
2014-01-02 Pizza 1
2014-01-02 Hamburger 2
Given that I have a list of columns like [Pizza, Hamburger, Cheeseburger]
, I want to transpose / pivot the data so it looks like this:
Date Pizza Hamburger Cheeseburger
2014-01-01 3 4 2
2014-01-02 1 2 0
What's the best way to do this in Cascalog?
Here's one way to do it:
(:use cascalog.api)
(def input
[["2014-01-01" "Pizza" 3]
["2014-01-01" "Hamburger" 4]
["2014-01-01" "Cheeseburger" 2]
["2014-01-02" "Pizza" 1]
["2014-01-02" "Hamburger" 2]])
(defn init-aggregate [k v]
{k v})
(def combine-aggregate
(partial merge-with +))
(defparallelagg aggregate
:init-var #'init-aggregate
:combine-var #'combine-aggregate)
(defn select-values [hashmap keyseq]
(map #(get hashmap %) keyseq))
(def columns
["Pizza" "Hamburger" "Cheeseburger"])
(defn transpose [data]
(<- [?date !pizza !hamburger !cheeseburger]
((<- [?date ?sum]
(data ?date ?name ?value)
(aggregate ?name ?value :> ?sum))
?date ?sum)
(select-values ?sum columns :> !pizza !hamburger !cheeseburger)))
(?- (stdout) (transpose input))
Let's have a quick run through the code:
Most of the action happens in the transpose
function, which contains two queries:
The inner query aggregates all ?name ?value
pairs for a given date into a ?sum
map.
The outer query uses select-values
to fetch the values for our columns out of the ?sum
map, and into the final result rows.
Since we know the columns are Pizza, Hamburger, Cheeseburger
we can simply hardcode them into the query. If you want to know how to make the columns dynamic, read Nathan Marz's blog post on creating a news feed in Cascalog.
Note that we have to represent the columns as nullable variables (using !
) since not every column will have a value for any given row. If we wanted to avoid null
results, we could change select-values
to use 0 as the default value.
(One caveat is that this won't produce any headers in the final output, so this has to be done as a post-processing step.)