Search code examples
rdataframeuser-defined-functionsrlangemmeans

Learning user defined functions to do ANOVA and emmeans pairwise comparison


I am trying to learn to write functions and exploring making a function to do an ANOVA and post F test. I have simplified this to the problem which is obtaining emmeans and associated all pairwise comparisons. The issue is the pairwise comparisons do not work??? I am real new at functions and have tried a lot of ways to fix this to no avail...

I am using the diamonds data set for reproducibility...

Here is the code:

# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)

dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))

results_LM <- function(data, iv, dv) {
  iv_name <- as.character(substitute(iv))
  dv_name <- as.character(substitute(dv))
  formula_str <- paste(dv_name, "~", iv_name)
  aov.model <- lm(formula_str, data = data)
  emm.means <- emmeans(aov.model,
                       specs = iv_name,
                       by = iv_name)
  emm.pairs <- pairs(emm.means)
  output <- list(
    aov_model_summary = summary(aov.model),
    emm.means,
    emm.pairs)
  return(output)
}

results_LM(dia, cut, price)

This issue is the emmeans output looks odd and the pairwise comparisons are missing

[[3]]
cut = Fair:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Good:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Ideal:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Premium:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Very Good:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA```

Note that when I do the anova and emmeans post F test outside of the function it works fine? 

Any help is appreciated and pointers to learn to write functions like this on dataframes would be GREAT. Thanks Bill

Solution

  • You were really close. I think the only thing is you don't need the by argument in the call to emmeans(). I might also suggest using reformulate() to make the formula. See below:

    library(tidyverse)
    library(emmeans)
    
    dia <- diamonds
    dia$cut <- as.factor(as.character(dia$cut))
    
    results_LM <- function(data, iv, dv) {
      iv_name <- as.character(substitute(iv))
      dv_name <- as.character(substitute(dv))
      formula_str <- reformulate(iv_name, response=dv_name)
      aov.model <- lm(formula_str, data = data)
      emm.means <- emmeans(aov.model,
                           specs = iv_name)
      emm.pairs <- pairs(emm.means)
      output <- list(
        aov_model_summary = summary(aov.model),
        emm.means,
        emm.pairs)
      names(output) <- c("Model", "Marginal Means", "Pairwise Comparisons")
      return(output)
    }
    
    results_LM(dia, cut, price)
    #> $Model
    #> 
    #> Call:
    #> lm(formula = formula_str, data = data)
    #> 
    #> Residuals:
    #>    Min     1Q Median     3Q    Max 
    #>  -4258  -2741  -1494   1360  15348 
    #> 
    #> Coefficients:
    #>              Estimate Std. Error t value Pr(>|t|)    
    #> (Intercept)   4358.76      98.79  44.122  < 2e-16 ***
    #> cutGood       -429.89     113.85  -3.776 0.000160 ***
    #> cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
    #> cutPremium     225.50     104.40   2.160 0.030772 *  
    #> cutVery Good  -377.00     105.16  -3.585 0.000338 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> Residual standard error: 3964 on 53935 degrees of freedom
    #> Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
    #> F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
    #> 
    #> 
    #> $`Marginal Means`
    #>  cut       emmean   SE    df lower.CL upper.CL
    #>  Fair        4359 98.8 53935     4165     4552
    #>  Good        3929 56.6 53935     3818     4040
    #>  Ideal       3458 27.0 53935     3405     3510
    #>  Premium     4584 33.8 53935     4518     4650
    #>  Very Good   3982 36.1 53935     3911     4052
    #> 
    #> Confidence level used: 0.95 
    #> 
    #> $`Pairwise Comparisons`
    #>  contrast            estimate    SE    df t.ratio p.value
    #>  Fair - Good            429.9 113.8 53935   3.776  0.0015
    #>  Fair - Ideal           901.2 102.4 53935   8.800  <.0001
    #>  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
    #>  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
    #>  Good - Ideal           471.3  62.7 53935   7.517  <.0001
    #>  Good - Premium        -655.4  65.9 53935  -9.946  <.0001
    #>  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
    #>  Ideal - Premium      -1126.7  43.2 53935 -26.067  <.0001
    #>  Ideal - Very Good     -524.2  45.1 53935 -11.636  <.0001
    #>  Premium - Very Good    602.5  49.4 53935  12.198  <.0001
    #> 
    #> P value adjustment: tukey method for comparing a family of 5 estimates
    

    If you wanted the rlang evaluation instead, you could replace as.character(substitute()) with as_label(enquo()).

    Since you're trying to learn how functions work and what you can do with them, let me demonstrate one other thing. You could give your returned result a class (e.g., "lm_pwc"), by setting the class of the output list with class(output) <- "lm_pwc".

    library(tidyverse)
    library(emmeans)
    
    dia <- diamonds
    dia$cut <- as.factor(as.character(dia$cut))
    
    results_LM <- function(data, iv, dv) {
      iv_name <- as_label(enquo(iv))
      dv_name <- as_label(enquo(dv))
      formula_str <- reformulate(iv_name, response=dv_name)
      aov.model <- lm(formula_str, data = data)
      emm.means <- emmeans(aov.model,
                           specs = iv_name)
      emm.pairs <- pairs(emm.means)
      output <- list(
        aov_model_summary = summary(aov.model),
        means = emm.means,
        pairs = emm.pairs)
      class(output) <- "lm_pwc"
      return(output)
    }
    

    Then, you could make a separate print.lm_pwc() function that would allow you to print only the desired part of the results. For example, the function below by default prints only the pairwise comparisons, but the returned result contains all three elements in the initial return. This allows you to return more objects than you print. Here's an example:

    
    print.lm_pwc <- function(x, ..., which=c("pairs", "model", "means", "all")){
      w <- match.arg(which, several.ok=FALSE)
      if(w %in% c("model", "all")){
        cat("Model Results:\n")
        print(x[[1]])
      }
      if(w %in% c("means", "all")){
        cat("Marginal Means:\n")
        print(x[[2]])
      }
      if(w %in% c("pairs", "all")){
        cat("Pairwise Comparisons:\n")
        print(x[[3]])
      }
    }
    
    res <- results_LM(dia, cut, price)
    res
    #> Pairwise Comparisons:
    #>  contrast            estimate    SE    df t.ratio p.value
    #>  Fair - Good            429.9 113.8 53935   3.776  0.0015
    #>  Fair - Ideal           901.2 102.4 53935   8.800  <.0001
    #>  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
    #>  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
    #>  Good - Ideal           471.3  62.7 53935   7.517  <.0001
    #>  Good - Premium        -655.4  65.9 53935  -9.946  <.0001
    #>  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
    #>  Ideal - Premium      -1126.7  43.2 53935 -26.067  <.0001
    #>  Ideal - Very Good     -524.2  45.1 53935 -11.636  <.0001
    #>  Premium - Very Good    602.5  49.4 53935  12.198  <.0001
    #> 
    #> P value adjustment: tukey method for comparing a family of 5 estimates
    

    Notice that the above only prints the pairwise comparisons, but the returned object contains all three elements.

    names(res)
    #> [1] "aov_model_summary" "means"             "pairs"
    

    You could print just the model with print(res, which = "model") or the marginal means with print(res, which="means") or you could show all results with print(res, which="all"). Note, that the first choice in the function definition (e.g., "pairs" above) is the default.

    Created on 2023-08-09 with reprex v2.0.2