Combining medcouple in ggplot's boxplot

ggplot's boxplot render any dot outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] as an outlier. As discussed in the R blog, this is not always a good idea, as false outliers may be reported due to asymmetric distributions. In that case, adjbox, from the robustbase R package, is often suggested. However, the quality of the plot is not as good as ggplot2.

My question is: does anybody know how to combine the exponential model used in adjbox to be used by ggplot when detecting outliers?

Solution

You're probably going to have to supply the values yourself. Referencing the example used in the adjbox() help file:

library(robustbase)
if(require("boot")) {
  ### Hubert and Vandervieren (2006), p. 10, Fig. 4.
  data(coal, package = "boot")
  coaldiff <- diff(coal$date)
  op <- par(mfrow = c(1,2))
  boxplot(coaldiff, main = "Original Boxplot")
  adjbox(coaldiff, main  = "Adjusted Boxplot")
  par(op)
}

This produces the following boxplots:

enter image description here

Then you can get the values that the adjbox() function used:

coald <- data.frame(coaldiff = diff(coal$date))  #$
adjboxStats(coald$coaldiff)$stats
# [1] 0.0000000 0.1013005 0.3107461 0.7529090 3.7180014

These are the values used to draw the adjusted boxplot. You can use supply this information to ggplot() and then calculate your own boxplots. There are probably better ways of doing this, but the first thing that I thought of doing was to make a new data set including the adjusted boxplot values:

library(ggplot2)
library(plyr)
d <- ddply(coald, .(coaldiff), transform,
  ymin = adjboxStats(coald$coaldiff)$stats[1],
  ymax = adjboxStats(coald$coaldiff)$stats[5],
  middle = adjboxStats(coald$coaldiff)$stats[3],
  lower = adjboxStats(coald$coaldiff)$stats[2],
  upper = adjboxStats(coald$coaldiff)$stats[4])

# Boxplot with unadjusted values:
p <- ggplot(d, aes(factor(1), coaldiff)) 
p + geom_boxplot()

# Boxplot with adjusted values (note that you have to add the outliers back in):
p + geom_boxplot(aes(ymin=ymin, ymax=ymax, middle=middle, upper=upper, lower=lower),
  stat="identity") + 
  geom_point(data=subset(d, coaldiff < ymin | coaldiff > ymax))

This will give you ggplot2 versions of the above plots:

enter image description here

Also note that Hadley Wickham advises against this "without a lot of thought" in a response to a similar question.