Search code examples
rdata-analysis

Hierarchical data in R- How do I sum over subsets, while maintaing the tree?


Probably, I haven't defined the problem very well. I don't seem to understand what R is returning out of sapply. I have a large dataframe of hierarchical data. About half the columns are factors and half are numerical. I want to get a new dataframe that contains some of the factors, and sums over the numerical columns, but I want the sums to remain separated by factor levels.

For instance, from the sample data below, I'd like to make a dataframe with the state, district, branch the same, but sum the data for orders of the same type but with different colours. I'm thinking that iterative use of sapply will do it, but I can't seem to get it to work.

sample data:

state district branch   order   colour  number  cost    amount
CA   central newtown    shoes   black   6   25.50  127.40
CA   central newtown    shoes   brown   3   32.12   75.40
CA   central newtown    gloves  blue    15  12.20  157.42
CA   central newtown    gloves  black   9   8.70    65.37
CA  central columbus    shoes   black   12  30.75   316.99
CA  central columbus    shoes   brown   1   40.98    45.00
CA  central columbus    gloves  blue    47  11.78   498.32
CA  central columbus    gloves  black   23  7.60    135.50

Solution

  • Another job for aggregate. Calling your data frame dat:

    aggregate(cbind(cost, amount) ~ state+district+branch+order, data=dat, FUN=sum)
    
    ##   state district   branch  order  cost amount
    ## 1    CA  central columbus gloves 19.38 633.82
    ## 2    CA  central  newtown gloves 20.90 222.79
    ## 3    CA  central columbus  shoes 71.73 361.99
    ## 4    CA  central  newtown  shoes 57.62 202.80
    

    On the left side of the ~, cbind is used to indicate that we want each column separately. If cost + amount were specified, it would mean the sum here because these are numeric. On the right side of the ~, we have factors, so the + means that we are aggregating by each level of each factor.