Search code examples
rdplyrtidyversedata-cleaningdata-wrangling

How to use summarise() and keep other columns in R?


This isn't a duplicate of this question, since I'm asking to retain a column with a condition. I'd like to group a value in a column x and sum the corresponding values in another column z. I have the following data:

df <- data.frame(x = c("a", "a"),
                 y = c("a", "b"),
                 z = c(4, 5))

x y z
a a 4
a b 5

I can use summarise() from dplyr to do this, like so:

df %>% 
  group_by(x) %>% 
  summarise(z = sum(z))

x z
a 9

But I want to keep the column y when x = y, so the final output should look like this:

x y z
a a 9

How do I do this?


Solution

  • The short answer

    What you want is this:

    df %>% 
      mutate(z = sum(z),.by = x) %>%
      filter(x == y)
      x y z
    1 a a 9
    

    The long answer

    The way summarise works is:

    summarise() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

    From what I understand, you're asking to group by x, use summarise to find the sum of z, but retain the value of y when x == y.

    The issue with that, as you can see from the description, is that you're wanting to find the sum of z while grouping only by x, but then keep the value of y when x == y somehow. You can either group by x and y, and get the sum of z using summarise, or you can get the sum of z grouped by just x using summarise, but you can't have both.

    It sounds like what you want is to use mutate. That way, you can replace every value of z with the sum of z grouped by x, and then filter the rows to the one where x == y.