I regularly have to perform a piped series of operations that groups by one or more (usually two) variables, finds the mean and confidence interval of one or more variables, and outputs the results to a summary table for plotting or reporting.
Usually I do this by copying and pasting a script e.g.:
aggdata <- data %>% group_by(Time, Category) %>%
summarise(mean.Volume = mean(Volume, na.rm = TRUE),
sd.Volume = sd(Volume, na.rm = TRUE),
n.Volume = n(),
Volume = sum(Volume))%>%
mutate(se.Volume = sd.Volume / sqrt(n.Volume),
lower.ci.Volume = mean.Volume - qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume,
upper.ci.Volume = mean.Volume + qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume)
So I tried writing a function for this, however for both of the following:
aggvols1 <- function(data, a, b, values) {
data %>% group_by(a, b) %>%
summarise(mean.Volume = mean(values, na.rm = TRUE),
sd.Volume = sd(values, na.rm = TRUE),
n.Volume = n(),
Volume = sum(values))%>%
mutate(se.Volume = sd.Volume / sqrt(n.Volume),
lower.ci.Volume = mean.Volume - qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume,
upper.ci.Volume = mean.Volume + qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume)
aggvols2 <- function(data, a, b, values) {
groupvars <-c(data$a,data$b) #also does not work if just use c(a,b)
data %>% group_by(groupvars) %>%
summarise(mean.Volume = mean(values, na.rm = TRUE),
sd.Volume = sd(values, na.rm = TRUE),
n.Volume = n(),
Volume = sum(values))%>%
mutate(se.Volume = sd.Volume / sqrt(n.Volume),
lower.ci.Volume = mean.Volume - qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume,
upper.ci.Volume = mean.Volume + qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume)
followed by e.g.
test <- aggvols1(data=salesdata, a=Participation, b=Time_Period, values=volumes_sold)
returns the same error message:
Error in aggvols1(data=salesdata, a=Participation, b=Time_Period, values=volumes_sold) :
unused arguments (a = Participation, b = Time_Period)
How can I make the arguments a and b get passed as the grouping variables so that the function returns a table of grouped means and CIs?
Ultimately my goal is not just to get this running but alter it so that instead of specifying two grouping variable columns and a single value column, I can specify a vector of grouping variables and a vector of values variables so that it can group by and calculate responses for one or multiple columns, adding the column name of each input "values" variable as a suffix to each output column for differentiation.
Any advice on how to fix the function so it runs and/or how to improve the function as described above would be greatly appreciated; I'm new to writing my own functions but am trying to move towards using them instead of just copying and pasting code where possible.
I also would like to adivse you to use rlang
syntax but do have a little different approach.
You have to use quotations to get dplyr to accept varnames the way you want to provide them inside a function.
The following code is working for me.
Also have a look at vignette("programming", "dplyr")
and the RStudio Cheat Sheet for rlang
here https://rstudio.com/resources/cheatsheets/.
aggvols1 <- function(data, a, b, values) {
a <- enquo(a)
b <- enquo(b)
values <- enquo(values)
data %>% group_by(!! a, !! b) %>%
summarise(mean.Volume = mean(!! values, na.rm = TRUE),
sd.Volume = sd(!! values, na.rm = TRUE),
n.Volume = n(),
Volume = sum(!! values))%>%
mutate(se.Volume = sd.Volume / sqrt(n.Volume),
lower.ci.Volume = mean.Volume - qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume,
upper.ci.Volume = mean.Volume + qt(1 - (0.05 / 2), n.Volume - 1) * se.Volume)