Search code examples
rdataframerowmedian

get location of row with median value in R data frame


I am a bit stuck with this basic problem, but I cannot find a solution.

I have two data frames (dummies below):

x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))

I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]

is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)


Solution

  • The problem is a little underspecified.

    • What should happen when the median isn't in the data?
    • What should happen if the median appears in the data multiple times?

    Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.

    with(x, which.min(abs(Col1 - median(Col1))))
    # [1] 2
    

    The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.

    quantile(x$Col1, .5, type = 1)
    # 50% 
    #   2 
    

    An option using quantile is

    with(x, which(Col1 == quantile(Col1, .5, type = 1)))
    # [1] 2
    

    This could possibly return multiple row-numbers.

    Edit: If you want it to only return the first match, you could modify it as shown below

    with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))