Search code examples
rtapply

using tapply in multiple variables


I have a set of data which contains information about customers and how much they have spent, each customer only appears once:

customer<-c("Andy","Bobby","Oscar","Oliver","Jane","Cathy","Emma","Chris")
age<-c(25,34,20,35,23,35,34,22)
gender<-c("male","male","male","male","female","female","female","female")
moneyspent<-c(100,100,200,200,400,400,500,200)

data<-data.frame(customer=customer,age=age,gender=gender,moneyspent=moneyspent)

If I want to calculate the average amount of money spent by male and female customers, I can use tapply:

tapply(moneyspent,gender,mean)

which gives:

female   male 
  375    150

However, I now want to find the average amount of money spent by both gender and age group and the result I am aiming for is:

 Male Age 20-30      Female Age 20-30      Male Age 30-40      Female Age 30-40
    150                     300                 150                   450

How could I modifty the tapply code such that it gives these results?

THANK YOU


Solution

  • You may need to use cut

    mat <- tapply(moneyspent, list(gender, age=cut(age, breaks=c(20,30,40), 
                    include.lowest=TRUE)), mean)
    
    nm1 <- outer(rownames(mat), colnames(mat), FUN=paste)
    setNames(c(mat), nm1)
    #female [20,30]   male [20,30] female (30,40]   male (30,40] 
    #       300            150            450            150 
    

    Other options include

    library(dplyr)
    data %>% 
         group_by(gender, age=cut(age, breaks=c(20,30,40), 
                  include.lowest=TRUE)) %>% 
         summarise(moneyspent=mean(moneyspent))
    

    Or

     library(data.table)
     setDT(data)[, list(moneyspent=mean(moneyspent)),
         by=list(gender, age=cut(age, breaks= c(20,30,40), include.lowest=TRUE))]