I´m trying to build a function which would receive: a dataframe (data), variable(s) to group by (groupby), and the name of a dependent variable (var); The function will then: a. create a plot of the means of var, separated by group(s) in groupby. In addition, a nice to have would be adding an anova at the end.
I´ll start with the end: my problem is obviously how to use (string) values in further manipulations in a user defined function.
I unfortunately have problems parsing groupby, which I couldn´t solve after a couple of days trying: I tried using:
!!!rlang::parse_exprs, strsplit, etc...
but with no success. Currently it looks like something like that (that´s the simplified version with less aesthetics..):
grp_comp <- function(data, groupby, var){
data %>%
filter(!is.na(var)) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean(!!!rlang::parse_expr(var)),
sd = sd(!!!rlang::parse_expr(var)),
se = sd / sqrt(n)) -> ddata
gg <- unlist(rlang::parse_exprs(groupby))
if(length(as.vector(rlang::parse_exprs(groupby))) == 1){
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean)) +
geom_point()}
else{
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean,
shape = as.character(gg[2]),
color= as.character(gg[2])),
group = as.character(gg[2]))}
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(var, " ~ ", form)
form
data%>%
filter(!is.na(var)) %>%
aov(formula = form) -> anova
summary(anova) -> anova
l <- list(ddata, g, anova)
l
}
My problems are:
a. groupby could contain one or two variables. I can´t manage to use groupby as an argument for group_by in the ggplots. Either I get: Error: Discrete value supplied to continuous scale
in case I use: x = gg[1]
, or I use: x = as.factor(gg[1]) or: as.character
and get the following plot (i.e. x is only labeled "BPL", but not grouped by the factor).
b. when I try to use two (instead of one) groupby factors, things get even worse and the plot is completely empty...
c. I manage to create the right formula for the anova, but when I try to actually calculate it I encounter: Error: $ operator is invalid for atomic vectors
-> any ideas why?
d. not critical, but any ideas for using the second, optional group as color & shape in aes() in case the argument contains two groups, without using the if
?
Many many thanks in advance!
Guy
It's not clear how you want to call this function, but you could do something like:
library(tidyverse)
grp_comp <- function(data, groupby, var){
ddata <- data %>%
filter(!is.na({{var}})) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean({{var}}),
sd = sd({{var}}),
se = sd / sqrt(n))
gg <- unlist(rlang::parse_exprs(groupby))
g <- if(length(as.vector(rlang::parse_exprs(groupby))) == 1)
ggplot(ddata, aes(x = !!gg[[1]], y = mean)) + geom_point()
else {
ggplot(ddata, aes(x = !!gg[[1]], y = mean, shape = factor(!!gg[[2]]),
color= !!gg[[2]], group = !!gg[[2]])) + geom_point()
}
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(deparse(substitute(var)), " ~ ", form)
data%>%
filter(!is.na({{var}})) %>%
aov(formula = as.formula(form)) -> anova
summary(anova) -> anova
list(ddata, g, anova)
}
This allows:
grp_comp(iris, "Species", Sepal.Length)
#> [[1]]
#> # A tibble: 3 x 5
#> Species n mean sd se
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 setosa 50 5.01 0.352 0.0498
#> 2 versicolor 50 5.94 0.516 0.0730
#> 3 virginica 50 6.59 0.636 0.0899
#>
#> [[2]]
#>
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Species 2 63.21 31.606 119.3 <2e-16 ***
#> Residuals 147 38.96 0.265
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And
grp_comp(mtcars, c("gear", "cyl"), mpg)
#> `summarise()` has grouped output by 'gear'. You can override using the
#> `.groups` argument.
#> [[1]]
#> # A tibble: 8 x 6
#> # Groups: gear [3]
#> gear cyl n mean sd se
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3 4 1 21.5 NA NA
#> 2 3 6 2 19.8 2.33 1.65
#> 3 3 8 12 15.0 2.77 0.801
#> 4 4 4 8 26.9 4.81 1.70
#> 5 4 6 4 19.8 1.55 0.776
#> 6 5 4 2 28.2 3.11 2.2
#> 7 5 6 1 19.7 NA NA
#> 8 5 8 2 15.4 0.566 0.400
#>
#> [[2]]
#>
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> gear 1 259.7 259.7 24.87 2.63e-05 ***
#> cyl 1 563.4 563.4 53.94 4.32e-08 ***
#> Residuals 29 302.9 10.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-08-27 with reprex v2.0.2