Probably, I haven't defined the problem very well. I don't seem to understand what R is returning out of sapply
. I have a large dataframe of hierarchical data. About half the columns are factors and half are numerical. I want to get a new dataframe that contains some of the factors, and sums over the numerical columns, but I want the sums to remain separated by factor levels.
For instance, from the sample data below, I'd like to make a dataframe with the state, district, branch the same, but sum the data for orders of the same type but with different colours. I'm thinking that iterative use of sapply
will do it, but I can't seem to get it to work.
sample data:
state district branch order colour number cost amount
CA central newtown shoes black 6 25.50 127.40
CA central newtown shoes brown 3 32.12 75.40
CA central newtown gloves blue 15 12.20 157.42
CA central newtown gloves black 9 8.70 65.37
CA central columbus shoes black 12 30.75 316.99
CA central columbus shoes brown 1 40.98 45.00
CA central columbus gloves blue 47 11.78 498.32
CA central columbus gloves black 23 7.60 135.50
Another job for aggregate
. Calling your data frame dat
:
aggregate(cbind(cost, amount) ~ state+district+branch+order, data=dat, FUN=sum)
## state district branch order cost amount
## 1 CA central columbus gloves 19.38 633.82
## 2 CA central newtown gloves 20.90 222.79
## 3 CA central columbus shoes 71.73 361.99
## 4 CA central newtown shoes 57.62 202.80
On the left side of the ~, cbind
is used to indicate that we want each column separately. If cost + amount
were specified, it would mean the sum here because these are numeric. On the right side of the ~, we have factors, so the + means that we are aggregating by each level of each factor.