Search code examples
rtime-seriestabular

R longitudinal data - Grouping by multiple factors


I am still attempting to create a detailed time-series dataframe. I'm attempting to get monthly data for multiple data points, then group by multiple factors. I'm not sure this is possible, as I have not seen an example close to this in the documentation, vignettes or on SO.

Here is the sample data I am trying to structure:

clients <- 1:100
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(clientID = sample(clients, 10000, replace = TRUE), 
                       OrderDate = sample(dates, 10000, replace = TRUE), 
                       category = sample(categories, 10000, replace = TRUE),
                       numProducts = sample(1:10, 1000, replace = TRUE), 
                       OrderTotal = sample(1:100, 1000, replace = TRUE))

The output looks like this:

head(products)
  clientID  OrderDate category numProducts OrderTotal
1       90 2012-03-20        D           9         18
2       66 2012-08-19        A           3         50
3       45 2012-05-25        A          10         75
4       28 2012-01-01        D           4         27
5       71 2012-02-28        A           4         76
6       26 2012-01-28        C           8         89

The structure I am trying to get to would look something like this:

          Category A                                                                    ...   Category E
ClientID  Jan2012numProducts  Jan2012OrderTotal  Feb2012numProducts  Feb2012OrderTotal  ...  Sep2012numProducts  Sep2012OrderTotal
1         12                  78                 6                   52                      0                   0
2         7                   218                3                   15                      1                   28
...
99999     20                  192                10                  100                     28                  156

I realize that the column names will likely get long and would look something like AJan2012numProducts or AJan2012OrderTotal, and that's fine.

Here are the procedures I'm unclear about - Again, I can't find them referenced in the documentation or the vignettes:

1) Can zoo aggregate for multiple observation fields? In this case, I want to get the sum of numProducts and OrderTotal at the same time, for the month. Even if zoo can't, I could use the merge function and join on clientID and category

2) Can zoo group by a factor (or multiple factors) to perform the aggregation? I want to be able to look at clientID and category by month.

3) Is there an ability to make the dataframe with category and month along the X axis. If not, if I could get the time-series data to simply group together by clientID and category, I could then use reshape to make the time-series wide using cast. I would need to get the dataframe into this structure:

head(df)
clientID   Month     category    numProducts  OrderTotal
1        2012-01-31  A           12           78
1        2012-01-31  B           0            0
....
99999    2012-09-30  D           6            71
99999    2012-09-30  E           1            28



cast(df, month~category, sum) (or something close to that)

Is any of this possible? Could you help with some examples?


Solution

  • A combination of using format.Date, xtabs, and ftable gets you pretty much exactly what you ask for. I shortened the example a bit but the principle should be clear. If you wanted the month-field to be shorter you could change the name of the dimension in the table-object or you could make a month-column and redo all the work with that. (I admit I had trouble figuring out how 'zoo' would enter this picture. It looks like a simple tabulation problem at the moment. Although ... I'm sure aggregate.zoo is capable of aggregating on multiple criteria and using the sum as the aggregation function.)

    First the two commands, then a console session output:

    prodtble <- xtabs(cbind(numProducts, OrderTotal) ~ clientID + 
                                                      format(OrderDate, "%b%Y") + 
                                                      category, 
                      data=products)
    ftable(prodtbl, row.vars=c("category","clientID"))
    

    Now the output:

    > xtabs(cbind(numProducts, OrderTotal) ~ clientID + format(OrderDate, "%b%Y")+category, data=products)
    , , category = A,  = numProducts
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1      23       0      16
           2       0       6      27
           3      30       0      21
           4      13      33      24
           5       5      20      12
    
    , , category = B,  = numProducts
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1       8      27      23
           2       8      14       4
           3       0       5       6
           4       8      13      39
           5       3      23       9
    
    , , category = C,  = numProducts
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1       0       6      20
           2      20      20       4
           3       0      17       0
           4      17      11       2
           5       7       3       8
    
    , , category = A,  = OrderTotal
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1      40       0      41
           2       0       5      33
           3      48       0      40
           4      16      28      24
           5      23      42      29
    
    , , category = B,  = OrderTotal
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1      14      24      19
           2      22      19      19
           3       0       2       4
           4      19      46      62
           5      10      38      10
    
    , , category = C,  = OrderTotal
    
            format(OrderDate, "%b%Y")
    clientID Feb2012 Jan2012 Mar2012
           1       0       2      39
           2      30      33       7
           3       0      44       0
           4      50      21      19
           5      16      14      28
    # You could have skipped the printout by assigning to 'prodtable' in the step above.
    # I thought is was useful pedagogically.
    
    > prodtbl <- .Last.value
    
    > ftable(prodtbl, row.vars=c("category","clientID"))
                      format(OrderDate, "%b%Y")     Feb2012                Jan2012                Mar2012           
                                                numProducts OrderTotal numProducts OrderTotal numProducts OrderTotal
    category clientID                                                                                               
    A        1                                           23         40           0          0          16         41
             2                                            0          0           6          5          27         33
             3                                           30         48           0          0          21         40
             4                                           13         16          33         28          24         24
             5                                            5         23          20         42          12         29
    B        1                                            8         14          27         24          23         19
             2                                            8         22          14         19           4         19
             3                                            0          0           5          2           6          4
             4                                            8         19          13         46          39         62
             5                                            3         10          23         38           9         10
    C        1                                            0          0           6          2          20         39
             2                                           20         30          20         33           4          7
             3                                            0          0          17         44           0          0
             4                                           17         50          11         21           2         19
             5                                            7         16           3         14           8         28
    

    This is the shortened example:

    clients <- 1:5
    dates <- seq(as.Date("2012/1/1"), as.Date("2012/3/31"), "days")
    categories <- LETTERS[1:3]
    products <- data.frame(clientID = sample(clients, 100, replace = TRUE), 
                           OrderDate = sample(dates, 100, replace = TRUE), 
                           category = sample(categories, 100, replace = TRUE),
                           numProducts = sample(1:10, 100, replace = TRUE), 
                           OrderTotal = sample(1:20, 100, replace = TRUE))