Search code examples
rmedian

Medians Values in R - Returns Rounded Number


I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.

For example sake, just looking at one, I would like to use:

    median(as.numeric(as.vector(subset(df,df$cluster == i )$avg))) 

I can see that

> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
 [1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444

And yet, the median is

> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5

I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.

I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.

I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.


Solution

  • When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.

    So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.

    If we assume there are no ties, then we can get these answers with

    which(rank(x) == length(x) / 2)
    # [1] 2
    which(rank(x) == length(x) / 2 + 1)
    # [1] 3
    

    If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.

    We can, of course, turn this into a little utility function:

    median_index = function(x) {
      lx = length(x)
      if (lx %% 2 == 1) {
        return(match(median(x), x))
      }
      which(rank(x, ties.method = "first") == lx/2 + 1)
    }