I´m trying to build a function which would receive: a dataframe (data), variable(s) to group by (groupby), and the name of a dependent variable (var); The function will then: a. create a plot of the means of var, separated by group(s) in groupby. In addition, a nice to have would be adding an anova at the end.
I´ll start with the end: my problem is obviously how to use (string) values in further manipulations in a user defined function.
I unfortunately have problems parsing groupby, which I couldn´t solve after a couple of days trying: I tried using:
!!!rlang::parse_exprs, strsplit, etc...
but with no success. Currently it looks like something like that (that´s the simplified version with less aesthetics..):
grp_comp <- function(data, groupby, var){
data %>%
filter(!is.na(var)) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean(!!!rlang::parse_expr(var)),
sd = sd(!!!rlang::parse_expr(var)),
se = sd / sqrt(n)) -> ddata
gg <- unlist(rlang::parse_exprs(groupby))
if(length(as.vector(rlang::parse_exprs(groupby))) == 1){
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean)) +
g <- ggplot(ddata, aes(x = as.character(gg[1]),
y = mean,
shape = as.character(gg[2]),
color= as.character(gg[2])),
group = as.character(gg[2]))}
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(var, " ~ ", form)
filter(!is.na(var)) %>%
aov(formula = form) -> anova
summary(anova) -> anova
l <- list(ddata, g, anova)
My problems are:
a. groupby could contain one or two variables. I can´t manage to use groupby as an argument for group_by in the ggplots. Either I get: Error: Discrete value supplied to continuous scale
in case I use: x = gg[1]
, or I use: x = as.factor(gg[1]) or: as.character
and get the following plot (i.e. x is only labeled "BPL", but not grouped by the factor).
b. when I try to use two (instead of one) groupby factors, things get even worse and the plot is completely empty...
c. I manage to create the right formula for the anova, but when I try to actually calculate it I encounter: Error: $ operator is invalid for atomic vectors
-> any ideas why?
d. not critical, but any ideas for using the second, optional group as color & shape in aes() in case the argument contains two groups, without using the if
Many many thanks in advance!
It's not clear how you want to call this function, but you could do something like:
grp_comp <- function(data, groupby, var){
ddata <- data %>%
filter(!is.na({{var}})) %>%
group_by(!!!rlang::parse_exprs(groupby)) %>%
summarize(n = n(),
mean = mean({{var}}),
sd = sd({{var}}),
se = sd / sqrt(n))
gg <- unlist(rlang::parse_exprs(groupby))
g <- if(length(as.vector(rlang::parse_exprs(groupby))) == 1)
ggplot(ddata, aes(x = !!gg[[1]], y = mean)) + geom_point()
else {
ggplot(ddata, aes(x = !!gg[[1]], y = mean, shape = factor(!!gg[[2]]),
color= !!gg[[2]], group = !!gg[[2]])) + geom_point()
form <- unlist(strsplit(groupby, ';', fixed = T))
form <- paste(form, collapse = " + ")
form <- paste(deparse(substitute(var)), " ~ ", form)
filter(!is.na({{var}})) %>%
aov(formula = as.formula(form)) -> anova
summary(anova) -> anova
list(ddata, g, anova)
This allows:
grp_comp(iris, "Species", Sepal.Length)
#> [[1]]
#> # A tibble: 3 x 5
#> Species n mean sd se
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 setosa 50 5.01 0.352 0.0498
#> 2 versicolor 50 5.94 0.516 0.0730
#> 3 virginica 50 6.59 0.636 0.0899
#> [[2]]
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Species 2 63.21 31.606 119.3 <2e-16 ***
#> Residuals 147 38.96 0.265
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
grp_comp(mtcars, c("gear", "cyl"), mpg)
#> `summarise()` has grouped output by 'gear'. You can override using the
#> `.groups` argument.
#> [[1]]
#> # A tibble: 8 x 6
#> # Groups: gear [3]
#> gear cyl n mean sd se
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3 4 1 21.5 NA NA
#> 2 3 6 2 19.8 2.33 1.65
#> 3 3 8 12 15.0 2.77 0.801
#> 4 4 4 8 26.9 4.81 1.70
#> 5 4 6 4 19.8 1.55 0.776
#> 6 5 4 2 28.2 3.11 2.2
#> 7 5 6 1 19.7 NA NA
#> 8 5 8 2 15.4 0.566 0.400
#> [[2]]
#> [[3]]
#> Df Sum Sq Mean Sq F value Pr(>F)
#> gear 1 259.7 259.7 24.87 2.63e-05 ***
#> cyl 1 563.4 563.4 53.94 4.32e-08 ***
#> Residuals 29 302.9 10.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-08-27 with reprex v2.0.2