Search code examples
rstatisticsdistribution

How to find quantiles of an empirical cumulative density function (ECDF)


I am using ecdf() function to calculate empirical cumulative density function (ECDF) from some random samples:

set.seed(0)
X = rnorm(100)
P = ecdf(X)

Now P gives ECDF and we could plot it:

plot(P)
abline(h = 0.6, lty = 3)

ecdf

My question is: how can I find the sample value x, such that P(x) = 0.6, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6?


Solution

  • In the following, I will not use ecdf(), as it is easy to obtain empirical cumulative density function (ECDF) ourselves.

    First, we sort samples X in ascending order:

    X <- sort(X)
    

    ECDF at those samples takes function values:

    e_cdf <- 1:length(X) / length(X)
    

    We could then sketch ECDF by:

    plot(X, e_cdf, type = "s")
    abline(h = 0.6, lty = 3)
    

    enter image description here

    Now, we are looking for the first value of X, such that P(X) >= 0.6. This is just:

    X[which(e_cdf >= 0.6)[1]]
    # [1] 0.2290196
    

    Since our data are sampled from standard normal distribution, the theoretical quantile is

    qnorm(0.6)
    # [1] 0.2533471
    

    So our result is quite close.


    Extension

    Since the inverse of CDF is quantile function (for example, the inverse of pnorm() is qnorm()), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf() is quantile(). This is not true!

    ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.

    For n sorted samples, sample quantile function is in fact a linear interpolation function of (x, y), with:

    • x-values being seq(0, 1, length = n);
    • y-values being sorted samples.

    We can define our own version of sample quantile function by:

    my_quantile <- function(x, prob) {
      if (is.unsorted(x)) x <- sort(x)
      n <- length(x)
      approx(seq(0, 1, length = n), x, prob)$y
      }
    

    Let's have a test:

    my_quantile(X, 0.6)
    # [1] 0.2343171
    
    quantile(X, prob = 0.6, names = FALSE)
    # [1] 0.2343171
    

    Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]].

    It is for this reason, that I refuse to use quantile() in my answer.