Search code examples
algorithmfunctionrtestinglinear-regression

John Tukey "median median" (or "resistant line") statistical test for R and linear regression


I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.

A student on a mailling list explain this algorithm in these terms :

"The way it's calculated is to divide the data into three groups, find the x-median and y-median values (called the summary point) for each group, and then use those three summary points to determine the line. The outer two summary points determine the slope, and an average of all of them determines the intercept."

Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/

Do you have an idea of where i could find this algorithm or R function ? In which packages, Thanks a lot !


Solution

  • There's a description of how to calculate the median-median line here. An R implementation of that is

    median_median_line <- function(x, y, data)
    {
      if(!missing(data))
      {
        x <- eval(substitute(x), data) 
        y <- eval(substitute(y), data) 
      }
      
      stopifnot(length(x) == length(y))
    
      #Step 1
      one_third_length <- floor(length(x) / 3)
      groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
         one_third_length,
         c(one_third_length, one_third_length + 1, one_third_length),
         c(one_third_length + 1, one_third_length, one_third_length + 1)
      ))
    
      #Step 2
      x <- sort(x)
      y <- sort(y)
      
      #Step 3
      median_x <- tapply(x, groups, median)                                 
      median_y <- tapply(y, groups, median)
    
      #Step 4
      slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
      intercept <- median_y[1] - slope * median_x[1]
    
      #Step 5
      middle_prediction <- intercept + slope * median_x[2]
      intercept <- intercept + (median_y[2] - middle_prediction) / 3
      c(intercept = unname(intercept), slope = unname(slope))
    }
    

    To test it, here's an example:

    dfr <- data.frame(
      time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
      distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
      
    median_median_line(time, distance, dfr) 
    #intercept     slope 
    #   -113.6     520.0
    

    Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.