Search code examples
rdataframeuniqueaverage

Mean all other other columns based on a single column in r


I have a large dataframe that has more than 40,000 columns and I am running into a problem similar to this Sum by distinct column value in R

shop <- data.frame( 
  'shop_id' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'Assets' = c(2, 15, 7, 5, 8, 3),
  'Liabilities' = c(5, 3, 8, 9, 12, 8),
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))

I have a column shop_id which is repeated many times. I have other values associated with that shop_id, such as assets, liabilities, profits, loss etc. I now want to average over all variables which have the same shop_id, i.e., I want unique shop_ids and want to average the all the columns that have same shop_id. Since, there are thousands of variables (columns) working with each column (variable) separately is very tedious.

My answer should be

 shop_id  Assets  Liabilities     sale    profit    
 Shop A   8.0     5.333333    8.666667  2.333333
 Shop B   5.0     9.000000   15.000000  6.000000
 Shop C   5.5    10.000000   14.000000  7.000000

I am currently using nested for loops as the following: As versatile as R is, I believe there should be a faster way to do this

idx <- split(1:nrow(shop), shop$shop_id)

newdata <- data.frame()

for( i in 1:length(idx)){
    newdata[i,1]<-c(names(idx)[i] )
    for (j in 2:ncol(shop)){
        newdata[i,j]<-mean(shop[unlist(idx[i]),j])
    }
}

Solution

  • Try data.table

    library(data.table)
    setDT(shop)[, lapply(.SD, mean), shop_id]
    #  shop_id Assets Liabilities      sale   profit
    #1:  Shop A    8.0    5.333333  8.666667 2.333333
    #2:  Shop B    5.0    9.000000 15.000000 6.000000
    #3:  Shop C    5.5   10.000000 14.000000 7.000000
    

    Or

    library(dplyr)
    shop %>% 
        group_by(shop_id)%>%
        summarise_each(funs(mean))
    # shop_id Assets Liabilities      sale   profit
    #1  Shop A    8.0    5.333333  8.666667 2.333333
    #2  Shop B    5.0    9.000000 15.000000 6.000000
    #3  Shop C    5.5   10.000000 14.000000 7.000000
    

    Or

    aggregate(.~shop_id, shop, FUN=mean)
    #   shop_id Assets Liabilities      sale   profit
    #1  Shop A    8.0    5.333333  8.666667 2.333333
    #2  Shop B    5.0    9.000000 15.000000 6.000000
    #3  Shop C    5.5   10.000000 14.000000 7.000000
    

    For 40,000 columns, I would use data.table or may be dplyr.