Search code examples
rgroupingstata

Summation of variables by Groups in R


I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:

BEFORE:

data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3), 
                    sex = c("m", "f", "m", "f", "m", "f"), 
                    value = c(10, 20, 30, 40, 50, 60))

AFTER:

data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3), 
                    sex = c("m", "f", "m", "f", "m", "f"), 
                    value = c(10, 20, 30, 40, 50, 60), 
                    sum = c(30, 30, 70, 70, 110, 110))

In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.


Solution

  • You are looking for ave

    > data2 <- transform(data1, sum=ave(value, month, FUN=sum))
      month sex value sum
    1     1   m    10  30
    2     1   f    20  30
    3     2   m    30  70
    4     2   f    40  70
    5     3   m    50 110
    6     3   f    60 110
    

    data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform

    Also data.table is helpful

    library(data.table)
    DT <- data.table(data1)
    DT[, sum:=sum(value), by=month]
    

    UPDATE

    We can also use a tidyverse approach which is simple, yet elegant:

    > library(tidyverse)
    > data1 %>% 
        group_by(month) %>% 
        mutate(sum=sum(value))
    # A tibble: 6 x 4
    # Groups:   month [3]
      month sex   value   sum
      <dbl> <fct> <dbl> <dbl>
    1     1 m        10    30
    2     1 f        20    30
    3     2 m        30    70
    4     2 f        40    70
    5     3 m        50   110
    6     3 f        60   110