Search code examples
rparsingggplot2user-defined-functionsanova

Parsing and using a string as an argument in a plot inside a user-defined function


I´m trying to build a function which would receive: a dataframe (data), variable(s) to group by (groupby), and the name of a dependent variable (var); The function will then: a. create a plot of the means of var, separated by group(s) in groupby. In addition, a nice to have would be adding an anova at the end.

I´ll start with the end: my problem is obviously how to use (string) values in further manipulations in a user defined function.

I unfortunately have problems parsing groupby, which I couldn´t solve after a couple of days trying: I tried using: !!!rlang::parse_exprs, strsplit, etc... but with no success. Currently it looks like something like that (that´s the simplified version with less aesthetics..):

grp_comp <- function(data, groupby, var){
  data %>%
    filter(!is.na(var)) %>%
    group_by(!!!rlang::parse_exprs(groupby)) %>%
    summarize(n = n(),
              mean = mean(!!!rlang::parse_expr(var)),
              sd = sd(!!!rlang::parse_expr(var)),
              se = sd / sqrt(n)) -> ddata
  gg <- unlist(rlang::parse_exprs(groupby))
    if(length(as.vector(rlang::parse_exprs(groupby))) == 1){
    g <- ggplot(ddata, aes(x = as.character(gg[1]), 
                            y = mean)) + 
      geom_point()}
  else{ 
    g <- ggplot(ddata, aes(x = as.character(gg[1]), 
                          y = mean, 
                          shape = as.character(gg[2]), 
                          color= as.character(gg[2])),
                group = as.character(gg[2]))}
  form <- unlist(strsplit(groupby, ';', fixed = T)) 
  form <- paste(form, collapse = " + ")
  form <- paste(var, " ~ ", form)
  form
    data%>%
    filter(!is.na(var)) %>%
    aov(formula = form) -> anova
  summary(anova) -> anova
  l <- list(ddata, g, anova)
  l
  }

My problems are: a. groupby could contain one or two variables. I can´t manage to use groupby as an argument for group_by in the ggplots. Either I get: Error: Discrete value supplied to continuous scale in case I use: x = gg[1], or I use: x = as.factor(gg[1]) or: as.character and get the following plot (i.e. x is only labeled "BPL", but not grouped by the factor).

enter image description here

b. when I try to use two (instead of one) groupby factors, things get even worse and the plot is completely empty... c. I manage to create the right formula for the anova, but when I try to actually calculate it I encounter: Error: $ operator is invalid for atomic vectors -> any ideas why? d. not critical, but any ideas for using the second, optional group as color & shape in aes() in case the argument contains two groups, without using the if ?

Many many thanks in advance!

Guy


Solution

  • It's not clear how you want to call this function, but you could do something like:

    library(tidyverse)
    
    grp_comp <- function(data, groupby, var){
      ddata <- data %>%
        filter(!is.na({{var}})) %>%
        group_by(!!!rlang::parse_exprs(groupby)) %>%
        summarize(n = n(),
                  mean = mean({{var}}),
                  sd = sd({{var}}),
                  se = sd / sqrt(n))
    
      gg <- unlist(rlang::parse_exprs(groupby))
      
      g <- if(length(as.vector(rlang::parse_exprs(groupby))) == 1) 
             ggplot(ddata, aes(x = !!gg[[1]], y = mean)) + geom_point()
           else {
             ggplot(ddata, aes(x = !!gg[[1]], y = mean, shape = factor(!!gg[[2]]), 
                               color= !!gg[[2]], group = !!gg[[2]])) + geom_point()
           }
      
      form <- unlist(strsplit(groupby, ';', fixed = T)) 
      form <- paste(form, collapse = " + ")
      form <- paste(deparse(substitute(var)), " ~ ", form)
    
      data%>%
        filter(!is.na({{var}})) %>%
        aov(formula = as.formula(form)) -> anova
      summary(anova) -> anova
      list(ddata, g, anova)
    }
    

    This allows:

    grp_comp(iris, "Species", Sepal.Length)
    #> [[1]]
    #> # A tibble: 3 x 5
    #>   Species        n  mean    sd     se
    #>   <fct>      <int> <dbl> <dbl>  <dbl>
    #> 1 setosa        50  5.01 0.352 0.0498
    #> 2 versicolor    50  5.94 0.516 0.0730
    #> 3 virginica     50  6.59 0.636 0.0899
    #> 
    #> [[2]]
    #> 
    #> [[3]]
    #>              Df Sum Sq Mean Sq F value Pr(>F)    
    #> Species       2  63.21  31.606   119.3 <2e-16 ***
    #> Residuals   147  38.96   0.265                   
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

    And

    grp_comp(mtcars, c("gear", "cyl"), mpg)
    #> `summarise()` has grouped output by 'gear'. You can override using the
    #> `.groups` argument.
    #> [[1]]
    #> # A tibble: 8 x 6
    #> # Groups:   gear [3]
    #>    gear   cyl     n  mean     sd     se
    #>   <dbl> <dbl> <int> <dbl>  <dbl>  <dbl>
    #> 1     3     4     1  21.5 NA     NA    
    #> 2     3     6     2  19.8  2.33   1.65 
    #> 3     3     8    12  15.0  2.77   0.801
    #> 4     4     4     8  26.9  4.81   1.70 
    #> 5     4     6     4  19.8  1.55   0.776
    #> 6     5     4     2  28.2  3.11   2.2  
    #> 7     5     6     1  19.7 NA     NA    
    #> 8     5     8     2  15.4  0.566  0.400
    #> 
    #> [[2]]
    #> 
    #> [[3]]
    #>             Df Sum Sq Mean Sq F value   Pr(>F)    
    #> gear         1  259.7   259.7   24.87 2.63e-05 ***
    #> cyl          1  563.4   563.4   53.94 4.32e-08 ***
    #> Residuals   29  302.9    10.4                     
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

    Created on 2022-08-27 with reprex v2.0.2