Search code examples
rgroup-bysum

Combine multiple rows to one row per group


I feel like my problem is an easy one to solve, however, I can't seem to figure it out.

I want to combine multiple rows that belong to the same group so that per group there is one row. This row has the sum over the rows for some variables and the mean for other variables. In the example, I only included variable treatment of which I need the sum across the rows of each group episode.

Df <- data.frame(country = c("A", "A", "A", "A", "A", "B","B", "B", "B"),
                 year = c("1950", "1951", "1952", "1953", "1954", "1950", "1951", "1952", "1953"), 
                 time1 = c("1950", "1951", "1951", "1953", "1954", "1950", "1951", "1952", "1952"), 
                 time2 = c("1951", "1953", "1953", "1954", "1955", "1951", "1952", "1954", "1954"),
                 episode = c("1", "2", "2", "3", "4", "1", "2", "3", "3"),
                 status = c(0, 1, 1, 0, 1, 1, 0, 1, 1),
                 treatment = c(10, "NA", 20, 5, "NA", "NA", 30, 100, 10))

Df2 <- data.frame(country = c("A", "A", "A", "A", "B", "B", "B"),
                   time1 = c("1950", "1951", "1953", "1954", "1950", "1951", "1952"), 
                   time2 = c("1951", "1953", "1954", "1955", "1951", "1952", "1954"),
                   episode = c("1", "2", "3", "4", "1", "2", "3"),
                   status = c(0, 1, 0, 1, 1, 0, 1),
                   treatment = c(10, 20, 5, 0, 0, 30, 110))

Any ideas on how to solve this?


Solution

  • Update: After clarification see comments:

    1. Column treatment is character. With type.convert(as.is=TRUE) we transform it to integer.
    2. Then we group and after that
    3. we use summarise with sum
    library(dplyr)
    Df %>% 
      type.convert(as.is=TRUE) %>% 
      group_by(country, time1, time2, episode, status) %>%
      summarise(treatment = sum(treatment, na.rm = TRUE))
    
      country time1 time2 episode status treatment
      <chr>   <int> <int>   <int>  <int>     <int>
    1 A        1950  1951       1      0        10
    2 A        1951  1953       2      1        20
    3 A        1953  1954       3      0         5
    4 A        1954  1955       4      1         0
    5 B        1950  1951       1      1         0
    6 B        1951  1952       2      0        30
    7 B        1952  1954       3      1       110
    

    First answer:

    library(dplyr)
    Df %>% 
      type.convert(as.is=TRUE) %>% 
      group_by(episode) %>% 
      summarise(sumTreatment=sum(treatment, na.rm = TRUE))
    
      episode sumTreatment
        <int>        <int>
    1       1           10
    2       2           50
    3       3          115
    4       4            0