I am still attempting to create a detailed time-series dataframe. I'm attempting to get monthly data for multiple data points, then group by multiple factors. I'm not sure this is possible, as I have not seen an example close to this in the documentation, vignettes or on SO.
Here is the sample data I am trying to structure:
clients <- 1:100
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(clientID = sample(clients, 10000, replace = TRUE),
OrderDate = sample(dates, 10000, replace = TRUE),
category = sample(categories, 10000, replace = TRUE),
numProducts = sample(1:10, 1000, replace = TRUE),
OrderTotal = sample(1:100, 1000, replace = TRUE))
The output looks like this:
clientID OrderDate category numProducts OrderTotal
1 90 2012-03-20 D 9 18
2 66 2012-08-19 A 3 50
3 45 2012-05-25 A 10 75
4 28 2012-01-01 D 4 27
5 71 2012-02-28 A 4 76
6 26 2012-01-28 C 8 89
The structure I am trying to get to would look something like this:
Category A ... Category E
ClientID Jan2012numProducts Jan2012OrderTotal Feb2012numProducts Feb2012OrderTotal ... Sep2012numProducts Sep2012OrderTotal
1 12 78 6 52 0 0
2 7 218 3 15 1 28
99999 20 192 10 100 28 156
I realize that the column names will likely get long and would look something like AJan2012numProducts or AJan2012OrderTotal, and that's fine.
Here are the procedures I'm unclear about - Again, I can't find them referenced in the documentation or the vignettes:
1) Can zoo
aggregate for multiple observation fields? In this case, I want to get the sum of numProducts and OrderTotal at the same time, for the month. Even if zoo
can't, I could use the merge
function and join on clientID and category
2) Can zoo
group by a factor (or multiple factors) to perform the aggregation? I want to be able to look at clientID and category by month.
3) Is there an ability to make the dataframe with category and month along the X axis. If not, if I could get the time-series data to simply group together by clientID and category, I could then use reshape
to make the time-series wide using cast
. I would need to get the dataframe into this structure:
clientID Month category numProducts OrderTotal
1 2012-01-31 A 12 78
1 2012-01-31 B 0 0
99999 2012-09-30 D 6 71
99999 2012-09-30 E 1 28
cast(df, month~category, sum) (or something close to that)
Is any of this possible? Could you help with some examples?
A combination of using format.Date
, xtabs
, and ftable
gets you pretty much exactly what you ask for. I shortened the example a bit but the principle should be clear. If you wanted the month-field to be shorter you could change the name of the dimension in the table-object or you could make a month-column and redo all the work with that. (I admit I had trouble figuring out how 'zoo' would enter this picture. It looks like a simple tabulation problem at the moment. Although ... I'm sure aggregate.zoo
is capable of aggregating on multiple criteria and using the sum as the aggregation function.)
First the two commands, then a console session output:
prodtble <- xtabs(cbind(numProducts, OrderTotal) ~ clientID +
format(OrderDate, "%b%Y") +
ftable(prodtbl, row.vars=c("category","clientID"))
Now the output:
> xtabs(cbind(numProducts, OrderTotal) ~ clientID + format(OrderDate, "%b%Y")+category, data=products)
, , category = A, = numProducts
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 23 0 16
2 0 6 27
3 30 0 21
4 13 33 24
5 5 20 12
, , category = B, = numProducts
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 8 27 23
2 8 14 4
3 0 5 6
4 8 13 39
5 3 23 9
, , category = C, = numProducts
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 0 6 20
2 20 20 4
3 0 17 0
4 17 11 2
5 7 3 8
, , category = A, = OrderTotal
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 40 0 41
2 0 5 33
3 48 0 40
4 16 28 24
5 23 42 29
, , category = B, = OrderTotal
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 14 24 19
2 22 19 19
3 0 2 4
4 19 46 62
5 10 38 10
, , category = C, = OrderTotal
format(OrderDate, "%b%Y")
clientID Feb2012 Jan2012 Mar2012
1 0 2 39
2 30 33 7
3 0 44 0
4 50 21 19
5 16 14 28
# You could have skipped the printout by assigning to 'prodtable' in the step above.
# I thought is was useful pedagogically.
> prodtbl <- .Last.value
> ftable(prodtbl, row.vars=c("category","clientID"))
format(OrderDate, "%b%Y") Feb2012 Jan2012 Mar2012
numProducts OrderTotal numProducts OrderTotal numProducts OrderTotal
category clientID
A 1 23 40 0 0 16 41
2 0 0 6 5 27 33
3 30 48 0 0 21 40
4 13 16 33 28 24 24
5 5 23 20 42 12 29
B 1 8 14 27 24 23 19
2 8 22 14 19 4 19
3 0 0 5 2 6 4
4 8 19 13 46 39 62
5 3 10 23 38 9 10
C 1 0 0 6 2 20 39
2 20 30 20 33 4 7
3 0 0 17 44 0 0
4 17 50 11 21 2 19
5 7 16 3 14 8 28
This is the shortened example:
clients <- 1:5
dates <- seq(as.Date("2012/1/1"), as.Date("2012/3/31"), "days")
categories <- LETTERS[1:3]
products <- data.frame(clientID = sample(clients, 100, replace = TRUE),
OrderDate = sample(dates, 100, replace = TRUE),
category = sample(categories, 100, replace = TRUE),
numProducts = sample(1:10, 100, replace = TRUE),
OrderTotal = sample(1:20, 100, replace = TRUE))