Search code examples
rstatisticsanovaposthoctukey

Looking to filter and/or tidy up TukeyHSD pairwise comparisons for two-way ANOVA


I'm running a 2-way ANOVA to compare how two variables affect the price of a service. However, when I run post-hoc pairwise comparisons I would like to only see rows that are comparable (i.e. when one of the independent variables matches). Here is an example to show the issue:

Variable 1 - Species: tuna, halibut, salmon, flounder

Variable 2 - Sex: Male, female

Dependent variable: mass

Here is what I'm using for 2-way ANOVA and pairwise comparisons:

> dat <- read.csv(<file location>)
> aov2 <- aov(mass ~ species * sex, data = dat)
> TukeyHSD(aov2, which = "species:sex")

When I run the pairwise comparisons of mass using TukeyHSD, I would only like to see ones such as Tuna:Male - Tuna:Female or Halibut:Male - Flounder:Male, but I would not like to see Flounder:Female - Halibut:Male. Basically, I just want to see comparisons where one of the variables has the same value.

Honestly am not even sure what to try, or if this is even possible. I had some NA values for some of the comparisons and I tried using:

TukeyHSD(aov2, which = "species:sex") %>% filter(is.na(diff))

but I get the following:

Error in UseMethod("filter") : 
no applicable method for 'filter' applied to an object of class "c('TukeyHSD', 'multicomp')"

If this is not possible, is there a way to filter data from a dataframe before it goes into 1-way ANOVA, for example by a certain value in one of the columns? Then I can try to run multiple 1-way ANOVAs to accomplish a similar goal. I would rather not have to create separate data files because the actual dataset I'm working with is very large and would be time-consuming to manually separate.


Solution

  • This is not possible with the current TukeyHSD function. However, with some modifications of this function, you can get the desired output. Below is the modified function, which adds a new argument called "match", the default being FALSE:

    TukeyHSD2 <- function (x, which = seq_along(tabs), ordered = FALSE, conf.level = 0.95, 
              match=FALSE, ...) 
    {
      mm <- model.tables(x, "means")
      if (is.null(mm$n)) 
        stop("no factors in the fitted model")
      tabs <- mm$tables
      if (names(tabs)[1L] == "Grand mean") 
        tabs <- tabs[-1L]
      tabs <- tabs[which]
      nn <- mm$n[names(tabs)]
      nn_na <- is.na(nn)
      if (all(nn_na)) 
        stop("'which' specified no factors")
      if (any(nn_na)) {
        warning("'which' specified some non-factors which will be dropped")
        tabs <- tabs[!nn_na]
        nn <- nn[!nn_na]
      }
      out <- setNames(vector("list", length(tabs)), names(tabs))
      MSE <- sum(x$residuals^2)/x$df.residual
      for (nm in names(tabs)) {
        tab <- tabs[[nm]]
        means <- as.vector(tab)
        nms <- if (length(dim(tab)) > 1L) {
          dn <- dimnames(tab)
          apply(do.call("expand.grid", dn), 1L, paste, collapse = ":")
        }
        else names(tab)
        n <- nn[[nm]]
        if (length(n) < length(means)) 
          n <- rep.int(n, length(means))
        if (as.logical(ordered)) {
          ord <- order(means)
          means <- means[ord]
          n <- n[ord]
          if (!is.null(nms)) 
            nms <- nms[ord]
        }
        
        center <- outer(means, means, `-`)
        keep <- lower.tri(center)
        
        dnames <- list(outer(nms, nms, paste, sep = "-"), c("diff", "lwr", "upr", "p adj"))
        dk1 <- dnames[[1L]][keep]
        
        if(match) {
          
          keep2 <- sapply(1:length(dk), function(x) {
            pattern <- '(.+):(.+)-(.+):(.+)'
            d1 <- sub(pattern, '\\1', dk1[x])
            d2 <- sub(pattern, '\\2', dk1[x])
            d3 <- sub(pattern, '\\3', dk1[x])
            d4 <- sub(pattern, '\\4', dk1[x])
            d1==d3 | d2==d4 } )
          
          dnames[[1L]] <- dk1[keep2]
        } else dnames[[1L]] <- dk1
    
        center <- center[keep]
        width <- qtukey(conf.level, length(means), x$df.residual) * 
          sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep]
        est <- center/(sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep])
        
        if(match) {
          center <- center[keep2]
          width <- width[keep2]
          est <- est[keep2]
        }
        
        pvals <- ptukey(abs(est), length(means), x$df.residual, 
                        lower.tail = FALSE)
        out[[nm]] <- array(c(center, center - width, center + 
                            width, pvals), c(length(width), 4L), dnames)
      }
      class(out) <- c("TukeyHSD", "multicomp")
      attr(out, "orig.call") <- x$call
      attr(out, "conf.level") <- conf.level
      attr(out, "ordered") <- ordered
      out
    }
    

    Send the above function to your R console and then test it on your data. Here is the output using the "warpbreaks" example dataset.

    fm1 <- aov(breaks ~ wool * tension, data = warpbreaks)
    
    > TukeyHSD(fm1, which="wool:tension")
    
      Tukey multiple comparisons of means
        95% family-wise confidence level
    
    Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
    
    $`wool:tension`
                   diff       lwr        upr     p adj
    B:L-A:L -16.3333333 -31.63966  -1.027012 0.0302143
    A:M-A:L -20.5555556 -35.86188  -5.249234 0.0029580
    B:M-A:L -15.7777778 -31.08410  -0.471456 0.0398172
    A:H-A:L -20.0000000 -35.30632  -4.693678 0.0040955
    B:H-A:L -25.7777778 -41.08410 -10.471456 0.0001136
    A:M-B:L  -4.2222222 -19.52854  11.084100 0.9626541
    B:M-B:L   0.5555556 -14.75077  15.861877 0.9999978
    A:H-B:L  -3.6666667 -18.97299  11.639655 0.9797123
    B:H-B:L  -9.4444444 -24.75077   5.861877 0.4560950
    B:M-A:M   4.7777778 -10.52854  20.084100 0.9377205
    A:H-A:M   0.5555556 -14.75077  15.861877 0.9999978
    B:H-A:M  -5.2222222 -20.52854  10.084100 0.9114780
    A:H-B:M  -4.2222222 -19.52854  11.084100 0.9626541
    B:H-B:M -10.0000000 -25.30632   5.306322 0.3918767
    B:H-A:H  -5.7777778 -21.08410   9.528544 0.8705572
    

    And here is the output from the modified version with "match" set to TRUE.

    > TukeyHSD2(fm1, which="wool:tension", match=TRUE)
    
      Tukey multiple comparisons of means
        95% family-wise confidence level
    
    Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
    
    $`wool:tension`
                   diff       lwr       upr     p adj
    B:L-A:L -16.3333333 -31.63966 -1.027012 0.0302143
    A:M-A:L -20.5555556 -35.86188 -5.249234 0.0029580
    A:H-A:L -20.0000000 -35.30632 -4.693678 0.0040955
    B:M-B:L   0.5555556 -14.75077 15.861877 0.9999978
    B:H-B:L  -9.4444444 -24.75077  5.861877 0.4560950
    B:M-A:M   4.7777778 -10.52854 20.084100 0.9377205
    A:H-A:M   0.5555556 -14.75077 15.861877 0.9999978
    B:H-B:M -10.0000000 -25.30632  5.306322 0.3918767
    B:H-A:H  -5.7777778 -21.08410  9.528544 0.8705572