Search code examples
rdplyrgroup-bysummarize

Performing operation among levels of grouped variable in R/dplyr


I want to perform a calculation among levels a grouping variable and fit this into a dplyr/tidyverse style workflow. I know this is confusing wording, but I hope the example below helps to clarify.

Below, I want to find the difference between levels "A" and "B" for each year that that I have data. One solution was to cast the data from long to wide format, and use mutate() in order to find the difference between A and B and create a new column with the results.

Ultimately, I'm working with a much larger dataset in which for each of N species, and for every year of sampling, I want to find the response ratio of some measured variable. Being able to keep the calculation in a long-format workflow would greatly help with later uses of the data.



library(tidyverse)
library(reshape)


set.seed(34)

test = data.frame(Year = rep(seq(2011,2020),2),
                  Letter = rep(c('A','B'),each = 10),
                  Response = sample(100,20))





test.results = test %>% 
  cast(Year ~ Letter, value = 'Response') %>% 
  mutate(diff = A - B)

#test.results
   Year  A   B diff
   2011 93  48   45
   2012 33  44  -11
   2013  9  80  -71
   2014 10  61  -51
   2015 50  67  -17
   2016  8  43  -35
   2017 86  20   66
   2018 54  99  -45
   2019 29 100  -71
   2020 11  46  -35

Is there some solution where I could group by Year, and then use a function like summarize() to calculate between the levels of variable "Letters"?

group_by(Year)%>%
summarise( "something here to perform a calculation between levels A and B of the variable "Letters")



Solution

  • You can subset the Response values for "A" and "B" and then take the difference.

    library(dplyr)
    
    test %>%
      group_by(Year) %>%
      summarise(diff = Response[Letter == 'A'] - Response[Letter == 'B'])
    
    #    Year  diff
    #   <int> <int>
    # 1  2011    45
    # 2  2012   -11
    # 3  2013   -71
    # 4  2014   -51
    # 5  2015   -17
    # 6  2016   -35
    # 7  2017    66
    # 8  2018   -45
    # 9  2019   -71
    #10  2020   -35
    

    In this example, we can also take advantage of the fact that if we arrange the data "A" would come before "B" so we can use diff :

    test %>%
      arrange(Year, desc(Letter)) %>%
      group_by(Year) %>%
      summarise(diff = diff(Response))